regex 使用Python从大型文本文件中提取特定URL [已关闭]

q1qsirdb  于 2023-05-30  发布在  Python
关注(0)|答案(1)|浏览(99)

**已关闭。**此问题正在寻求书籍、工具、软件库等的建议。它不符合Stack Overflow guidelines。目前不接受答复。

我们不允许问题寻求书籍,工具,软件库等的建议。您可以编辑问题,以便可以用事实和引用来回答。
13小时前关闭
Improve this question
我试图从一个大文本文件中提取特定的URL。

数据(或文本文件):

[{"profile":"164","width":638,"height":360,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4","cdn":"akamai_interconnect","quality":"360p","id":"a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","origin":"gcs"},{"profile":"165","width":958,"height":540,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4","cdn":"akamai_interconnect","quality":"540p","id":"08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","origin":"gcs"},{"profile":"174","width":1278,"height":720,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4","cdn":"akamai_interconnect","quality":"720p","id":"625db5ed-2175-4977-a562-de40d84aab45","origin":"gcs"}]},"file_codecs":{"av1":[],"avc":["a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92","08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","625db5ed-2175-4977-a562-de40d84aab45"],"hevc":{"dvh1":[],"hdr":[],"sdr":[]}},"lang":"en","referrer"

Here's the same text that can be viewed using an online text viewer.,这是我试图提取的部分的图像:

问题描述:

  • 文本文档中有几个链接,其中包括与我试图提取的链接相似(不相同)的链接。我需要提取的only链接(来自txt文件)包含720p
  • URL看起来像“https://vod-progressive.akamaized.net [..] .mp4”。
  • 请注意,这样我们就不会最终只得到所需链接的一个子集,链接内部包含一个.mp4,最后也包含一个。也就是说,它看起来像“https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4”,因为你应该能够看到在突出显示的文本在所附的图像。
    以下代码提取了所有url,但只提取了部分url:
text_string = ""
with open("text.txt", "r") as text_file:
    text_string = text_file.read()

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text_string)

print(urls)

它将打印以下内容:['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']

需要修改的代码:

1.链接未完全获得。仅获得其初始部分。
1.并不是每个环节都需要。所需的链接在“问题说明”下进行了说明。

6fe3ivhb

6fe3ivhb1#

import json

data = json.loads(json_str)
data["url"]

您想要的URL可以通过data[“url”]检索

相关问题