regex 使用Python从大型文本文件中提取特定URL [已关闭]

q1qsirdb 于 2023-05-30 发布在 Python

关注(0)|答案(1)|浏览(98)

**已关闭。**此问题正在寻求书籍、工具、软件库等的建议。它不符合Stack Overflow guidelines。目前不接受答复。

我们不允许问题寻求书籍，工具，软件库等的建议。您可以编辑问题，以便可以用事实和引用来回答。
13小时前关闭
Improve this question
我试图从一个大文本文件中提取特定的URL。

数据（或文本文件）：

[{"profile":"164","width":638,"height":360,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-central1-h264-360p2F3314456635875455007.mp4~hmac=938482df34b6c94756876549908053d738d426ca/vimeo-transcode-storage-prod-us-central1-h264-360p/01/2982/28/714910924/3314655007.mp4","cdn":"akamai_interconnect","quality":"360p","id":"a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","origin":"gcs"},{"profile":"165","width":958,"height":540,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-west1-h264-540p%2F01%2F2982%2F28%2F714910924%2F3314655137.mp4~hmac=938482df31237894b6c947e9908053d738d4dd26ca/vimeo-transcode-storage-prod-us-west1-h264-540p/01/2982/28/714910924/3314655137.mp4","cdn":"akamai_interconnect","quality":"540p","id":"08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","origin":"gcs"},{"profile":"174","width":1278,"height":720,"mime":"video/mp4","fps":30,"url":"https://vod-progressive.akamaized.net/exp=1685291596~acl=%2Fvimeo-transcode-storage-prod-us-east1-h264-720p%2F01%FFF2F2F22F3314655135.mp4~hmac=c44e0c048f008ce4996434ce5d6543234567189dbc63/vimeo-transcode-storage-prod-us-east1-h264-720p/01/2982/28/714910924/3314655135.mp4","cdn":"akamai_interconnect","quality":"720p","id":"625db5ed-2175-4977-a562-de40d84aab45","origin":"gcs"}]},"file_codecs":{"av1":[],"avc":["a81ee7e1-4ae0-4b3a-85fe-7d0d8ff16b93","d5589a0c-a69a-4428-ad7e-0c2a1e4e9f92","08b0edbf-ce47-4dd1-aa17-3c14ed3eccc8","625db5ed-2175-4977-a562-de40d84aab45"],"hevc":{"dvh1":[],"hdr":[],"sdr":[]}},"lang":"en","referrer"

Here's the same text that can be viewed using an online text viewer.，这是我试图提取的部分的图像：

问题描述：

文本文档中有几个链接，其中包括与我试图提取的链接相似（不相同）的链接。我需要提取的only链接（来自txt文件）包含720p。
URL看起来像“https://vod-progressive.akamaized.net [..] .mp4”。
请注意，这样我们就不会最终只得到所需链接的一个子集，链接内部包含一个.mp4，最后也包含一个。也就是说，它看起来像“https://vod-progressive.akamaized.net [..] .mp4~hmac= [..] .mp4”，因为你应该能够看到在突出显示的文本在所附的图像。
以下代码提取了所有url，但只提取了部分url：

text_string = ""
with open("text.txt", "r") as text_file:
    text_string = text_file.read()

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text_string)

print(urls)

它将打印以下内容：['https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596', 'https://vod-progressive.akamaized.net/exp=1685291596']

需要修改的代码：

1.链接未完全获得。仅获得其初始部分。
1.并不是每个环节都需要。所需的链接在“问题说明”下进行了说明。

regex

来源：https://stackoverflow.com/questions/76353423/extracting-a-particular-url-from-a-large-text-file-using-python

1条答案

按热度按时间

6fe3ivhb1#

import json

data = json.loads(json_str)
data["url"]

您想要的URL可以通过data[“url”]检索

赞(0）回复(0）举报 2023-05-30

我来回答

regex 使用Python从大型文本文件中提取特定URL [已关闭]

1条答案

相关问题

热门标签

最新问答