2023年5月29日 04:12:00go评论76阅读模式

英文:

Extracting a particular URL from a large text file using Python

问题

我正在尝试从一个大文本文件中提取特定的URL。

数据（或文本文件）:

[{&quot;profile&quot;:&quot;164&quot;,&quot;width&quot;:638,&quot;height&quot;:360,&quot;mime&quot;:&quot;video/mp4&quot;,&quot;fps&quot;:30,&quot;url&quot;:&quot;https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4&quot;,&quot;cdn&quot;:&quot;akamai_interconnect&quot;,&quot;quality&quot;:&quot;360p&quot;,&quot;id&quot;:&quot;a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93&quot;,&quot;origin&quot;:&quot;gcs&quot;},{&quot;profile&quot;:&quot;165&quot;,&quot;width&quot;:958,&quot;height&quot;:540,&quot;mime&quot;:&quot;video/mp4&quot;,&quot;fps&quot;:30,&quot;url&quot;:&quot;https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4&quot;,&quot;cdn&quot;:&quot;akamai_interconnect&quot;,&quot;quality&quot;:&quot;540p&quot;,&quot;id&quot;:&quot;08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8&quot;,&quot;origin&quot;:&quot;gcs&quot;},{&quot;profile&quot;:&quot;174&quot;,&quot;width&quot;:1278,&quot;height&quot;:720,&quot;mime&quot;:&quot;video/mp4&quot;,&quot;fps&quot;:30,&quot;url&quot;:&quot;https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4&quot;,&quot;cdn&quot;:&quot;akamai_interconnect&quot;,&quot;quality&quot;:&quot;720p&quot;,&quot;id&quot;:&quot;625db5ed-2175-4977-a562-de40d84aab45&quot;,&quot;origin&quot;:&quot;gcs&quot;}]},&quot;file_codecs&quot;:{&quot;av1&quot;:[],&quot;avc&quot;:[&quot;a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93&quot;,&quot;d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92&quot;,&quot;08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8&quot;,&quot;625db5ed-2175-4977-a562-de40d84aab45&quot;],&quot;hevc&quot;:{&quot;dvh1&quot;:[],&quot;hdr&quot;:[],&quot;sdr&quot;:[]}},&quot;lang&quot;:&quot;en&quot;,&quot;referrer&quot;

这里是可以用在线文本查看器查看的相同文本。这里是我试图提取的部分的图像:

问题描述:

文档中有几个链接，包括看起来类似（但不相同）于我试图提取的链接。我唯一需要提取的链接（来自txt文件）包含其中的"720p"。
URL看起来像：“https://vod-progressive.akamaized.net [..] .mp4”。
请注意，为了避免仅获取所需链接的子集，链接中包含一个".mp4"，一次在内部，一次在末尾。也就是说，它看起来像：“https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4”，如您可以在附加的图像中突出显示的文本中看到的那样。

以下代码仅部分提取所有URL:

text_string = &quot;&quot;
with open(&quot;text.txt&quot;, &quot;r&quot;) as text_file:
    text_string = text_file.read()

urls = re.findall(&#39;http展开收缩?://(?:[a-zA-Z]|[0-9]|[$-_@.&amp;+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+&#39;, text_string)

print(urls)

它打印出以下内容: ['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']

需要对代码进行的修改:

链接未被完全获取。仅获取到了其初始部分。
不是每个链接都是所需的。所需链接的描述在“问题描述”下面。

英文:

I am trying to extract a particular URL from a large text file.

Data (or text file):

[{&quot;profile&quot;:&quot;164&quot;,&quot;width&quot;:638,&quot;height&quot;:360,&quot;mime&quot;:&quot;video/mp4&quot;,&quot;fps&quot;:30,&quot;url&quot;:&quot;https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4&quot;,&quot;cdn&quot;:&quot;akamai_interconnect&quot;,&quot;quality&quot;:&quot;360p&quot;,&quot;id&quot;:&quot;a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93&quot;,&quot;origin&quot;:&quot;gcs&quot;},{&quot;profile&quot;:&quot;165&quot;,&quot;width&quot;:958,&quot;height&quot;:540,&quot;mime&quot;:&quot;video/mp4&quot;,&quot;fps&quot;:30,&quot;url&quot;:&quot;https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4&quot;,&quot;cdn&quot;:&quot;akamai_interconnect&quot;,&quot;quality&quot;:&quot;540p&quot;,&quot;id&quot;:&quot;08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8&quot;,&quot;origin&quot;:&quot;gcs&quot;},{&quot;profile&quot;:&quot;174&quot;,&quot;width&quot;:1278,&quot;height&quot;:720,&quot;mime&quot;:&quot;video/mp4&quot;,&quot;fps&quot;:30,&quot;url&quot;:&quot;https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4&quot;,&quot;cdn&quot;:&quot;akamai_interconnect&quot;,&quot;quality&quot;:&quot;720p&quot;,&quot;id&quot;:&quot;625db5ed-2175-4977-a562-de40d84aab45&quot;,&quot;origin&quot;:&quot;gcs&quot;}]},&quot;file_codecs&quot;:{&quot;av1&quot;:[],&quot;avc&quot;:[&quot;a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93&quot;,&quot;d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92&quot;,&quot;08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8&quot;,&quot;625db5ed-2175-4977-a562-de40d84aab45&quot;],&quot;hevc&quot;:{&quot;dvh1&quot;:[],&quot;hdr&quot;:[],&quot;sdr&quot;:[]}},&quot;lang&quot;:&quot;en&quot;,&quot;referrer&quot;

Here's the same text that can be viewed using an online text viewer. and here's an image of the part that I am trying to extract:

Description of the problem:

There are several links in the text document, including links that look similar (not same) to the one that I am trying to extract. The only link (from the txt file) that I need to extract contains 720p somewhere in it.
The URL looks like "https://vod-progressive.akamaized.net [..] .mp4".
Do note, so that we don't end up getting only a subset of the desired link, that the link contains a .mp4 once inside and once in the end. That is, it looks like "https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4" as you should be able to see in the highlighted text in the image attached.

The following code extracts all the urls but only partially:

text_string = &quot;&quot;
with open(&quot;text.txt&quot;, &quot;r&quot;) as text_file:
    text_string = text_file.read()

urls = re.findall(&#39;http展开收缩?://(?:[a-zA-Z]|[0-9]|[$-_@.&amp;+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+&#39;, text_string)

print(urls)

It prints the following: ['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']

Modifications needed to the code:

The links are not fully obtained. Only the initial part of it is obtained.
Not every link is needed. The desired link is described under "Description of the Problem".

答案1

得分: 1

import json

data = json.loads(json_str)
data["url"]

你想要的URL可以通过 data["url"] 获取。

英文:

import json

data = json.loads(json_str)
data[&quot;url&quot;]

The url that you want can be retrieved by data["url"]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python从大型文本文件中提取特定URL

问题

答案1

未连接的块

call_soon_threadsafe不会在async函数内部调用该函数。

如何使用Pytube下载需要年龄验证的视频？

按组合并并将每次出现保存在列中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论