python中url的正则表达式

ching-yu • 5 年前 • 1699 次点击

我想删除句子中的所有URL。
这是我的代码:

import ijson
f = open("/content/drive/My Drive/PTT ç¬è²/content/MakeUp/PTT_MakeUp_content_0_1000.json")
objects = ijson.items(f, 'item')

for obj in list(objects):
    article = obj['content']
    ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # question here
    for r in ret:
      article = article.replace(r, "")
    print(article)

但是“http”的url仍然留在句子中。

article_example = "ç¼å½±ç¤é·éæ¨£ http://i.imgur.com/uxvRo3h.jpg èªªçç å¾ä¸å¥½æ"

知道吗? 谢谢你的帮助。

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/38386

1699 次点击

文章 [ 3 ] | 最新文章 5 年前

• 1 楼

gilch 5 年前

改变 [s*] 到 s? . 前者是由两个字符组成的集合。后者是可选字符。有这样的网站 regex101.com 这让您可以在Python方言中尝试正则表达式。它将解释regex的每个部分的解释。

• 2 楼

Allan The fourth bird 5 年前

URL以http开头,并且在您的模式中匹配 [s*] 两者都匹配 s 或 * 在 character class .

我想你在找

https?:[a-zA-Z0-9_.+-/#~]+

Regex demo γ Python demo

import re
regex = r"https?:[a-zA-Z0-9_.+-/#~]+ "
article = "ç¼å½±ç¤é·éæ¨£ http://i.imgur.com/uxvRo3h.jpg èªªçç å¾ä¸å¥½æ"
result = re.sub(regex, "", article)
print(result)

结果

ç¼å½±ç¤é·éæ¨£ èªªçç å¾ä¸å¥½æ

一个较短的表达式,其匹配范围稍宽,也可以是非空白的1+倍。 \S+ char后跟0+乘以空格,以匹配原始模式中的尾随空格。

\bhttps?:\S+ *

Regex demo

• 3 楼

Tim Biegeleisen 5 年前

一个简单的解决方法就是替换模式 https?://\S+ 使用空字符串:

article_example = "ç¼å½±ç¤é·éæ¨£ http://i.imgur.com/uxvRo3h.jpg èªªçç å¾ä¸å¥½æ"
output = re.sub(r'https?://\S+', '', article_example)
print(output)

打印内容:

ç¼å½±ç¤é·éæ¨£  èªªçç å¾ä¸å¥½æ

我的模式假设后面的非空白字符 http:// 或 https:// 是URL的一部分。

登录后回复