私信  •  关注

Philip

Philip 最近回复了

通过您的两个示例,我可以使用Python的非贪婪语法创建regex,如前所述 here

1:[123]   2:[foo]   3:[456]
1:[2]   2:[foo1c#BAR]   3:[]

下面是正则表达式:

^([^A-Za-z]*)(.*?)([^A-Za-z]*)$

mo.group(2) 你想要什么,在哪里 mo

下面将从所讨论的URL中得到63个URL

import requests
import re

url = "https://en.wikipedia.org/wiki/Collatz_conjecture"
text = requests.get(url).text

url_pattern = r"((http(s)?://)([\w-]+\.)+[\w-]+[.com]+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)"

# Get all matching patterns of url_pattern
# this will return a list of tuples 
# where we are only interested in the first item of the tuple
urls = re.findall(url_pattern, text)

# using list comprehension to get the first item of the tuple, 
# and the set function to filter out duplicates
unique_urls = set([x[0] for x in urls])
print(f'Number of unique HTML tags: {len(unique_urls)} found on {url}')

输出:

Number of unique HTML tags: 63 found on https://en.wikipedia.org/wiki/Collatz_conjecture