regex 正则表达式来清理亚马逊链接[已关闭]

t5fffqht  于 2023-03-20  发布在  其他
关注(0)|答案(1)|浏览(106)

已关闭。此问题需要超过focused。当前不接受答案。
**想要改进此问题吗?**更新此问题,使其仅关注editing this post的一个问题。

5天前关闭。
Improve this question
我试图创建一个正则表达式来清理亚马逊的网址,但我不能删除中间部分。
从所附的例子中,我希望“组2”在最终结果中消失,这可能吗?
我使用这个正则表达式:^(?:http:\/\/|www\.|https:\/\/)([^\/]+)(\s?.*)(/[dg]p/)([^/]+)
我会得到这样的结果:

https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr1 --> https://www.amazon.com/dp/B07P4LVZNL

https://www.amazon.com/adidas-Originals-Solid-Melange-Purple/dp/B07DXPN7TK/ref=sr_1_fkmr2_1?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-1-fkmr2 --> https://www.amazon.com/dp/B07DXPN7TK

https://www.amazon.es/gp/B07R23QGH6/ref=sr_1_fkmr2_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr2 --> https://www.amazon.com/gp/B07R23QGH6

https://www.amazon.it/dp/B07R23QGH6/ --> https://www.amazon.it/dp/B07R23QGH6/

https://regex101.com/r/AFGk96/1

unhi4e5o

unhi4e5o1#

你转义过度了。斜杠在正则表达式中没有意义,没有必要转义它们:

^(?:http:\/\/|www\.|https:\/\/)([^\/]+)(\s?.*)(/[dg]p/)([^/]+)

可能是(有一些其他简化)

^(?:https?://)?(www[^/]+).*?(/[dg]p/[^/]+)

当我们把.*加到字符串末尾以匹配字符串的尾部时,我们得到的结果是:

import re

amazon_url_pattern = re.compile(r'^(?:https?://)?(www[^/]+).*?(/[dg]p/[^/]+).*')

url = 'https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr1'
result = amazon_url_pattern.sub(r'\1\2/', url)

print(result)

印刷品

https://www.amazon.com/dp/B07P4LVZNL/

相关问题