python3.6如何从.txt中正则化URL?

passwordhash • 5 年前 • 1665 次点击

我需要从文本文件中获取一个url。

URL存储在字符串中,如下所示: 'URL=http://example.net' .

不管怎样,我可以在 = 烧焦直到 . 在里面 '.net' ?

我能用一下 re 模块?

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/40773

1665 次点击

文章 [ 5 ] | 最新文章 5 年前

• 1 楼

Abhishek 5 年前

请试试这个。这对我有效。

import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

• 2 楼

Marius Mucenicu 5 年前

你不需要正则表达式 re 对于这样一个简单的任务。

如果字符串的形式为: 'URL=http://example.net'

然后可以使用基本的python以多种方式解决这个问题,其中之一是:


file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1  # this gives you the first position after =
end_position = file_line.find('.')

# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]

当然,这只是要提取一个url。假设您使用的是一个大文本,您希望在其中提取所有url,您将希望将此逻辑放入 function 以便您可以重用它,并围绕它进行构建(通过 while 或 for 循环,并且,根据迭代的方式,跟踪最后提取的url的位置等等)。

忠告

这个问题在这个论坛上得到了很多回答,有很多非常熟练的人,例如: here , here , here 和 here ,具体到你会惊讶的程度。这些还不全是,我只是在我的搜索结果中找到了前几个。

考虑到(在发布这个问题的时候)你是这个网站的新贡献者,我友好的建议是投入一些精力去寻找这样的答案。这是一项至关重要的技能,在编程界你离不开它。

记住,无论你遇到什么问题 很有可能 在这个论坛上有人已经遇到了,并且得到了一个答案,你只需要找到它。

• 3 楼

Always Sunny 5 年前

一种方法是使用regex lookbehind和lookahead在 = 以前 .

import re

regex = r"(?<==)(.*)(?=\.)"

test_str = "\"URL=http://example.net\""

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

• 4 楼

gerard Garvey 5 年前

我没有太多的信息,但我会尽力帮助我得到我假设url=是字符串的一部分,在这种情况下你可以这样做

re.findall(r'URL=(.*?).', STRINGNAMEHERE)

让我更详细地谈一下(.*)点表示任何字符(换行字符除外)星号表示零次或多次出现,而?很难解释,但这里有一个来自文档的例子“导致结果re与前面re的0或1个重复匹配。抗体?将匹配A或AB。“括号将其全部放入一个组中。所有这些基本上意味着它将在url=和中找到所有内容。

• 5 楼

Marius Mucenicu 5 年前

text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)

print(urls)

输出:

[ 
   'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
   'https://www.google.com/',
   'https://www.facebook.com/',
   'https://twitter.com'
]

登录后回复