如何使用正则表达式python获取网页上所有唯一的HTML标记?

justanothercoder • 5 年前 • 2096 次点击

我对Python和scraping网页还很陌生。我有一个html页面的html源代码:

import requests
text =
requests.get("https://en.wikipedia.org/wiki/Collatz_conjecture").text

我想做的是计算这个页面上唯一的HTML标记的数量。例如:。结束标记不计数(并且只计数一次)

是的,我知道使用诸如beautifuldsoup这样的HTML解析器会容易得多,但是我希望只使用正则表达式来实现这一点。

我已经用蛮力计算过了,答案大概是60个独特的标签。我该怎么做呢?

我已经试过使用re.findall(),但没有用。

'''

网站链接: https://en.wikipedia.org/wiki/Collatz_conjecture

'''

因为答案是60左右,我希望输出是

“唯一HTML标记数:60”

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/50698

2096 次点击

文章 [ 2 ] | 最新文章 5 年前

• 1 楼

akin_ai 5 年前

拜托!不要在regex中解析HTML使用bs4这样的模块。但如果你坚持这样做的话:

import requests
import re

url = 'https://en.wikipedia.org/wiki/Collatz_conjecture'
text = requests.get(url).text
tags = re.findall('<[^>]*>',text)

total=[]

for i in range(len(tags)):
    total.append(re.match('<[^\s\>]+',tags[i]).group())

total=[elem+'>' for elem in total]
r= re.compile('</[^<]')

unwanted =list(filter(r.match,total))

un=['<!-->','<!--[if>','<!DOCTYPE>','<![endif]-->']
unwanted.extend(un)

final=[x for x in list(set(total)) if x not in set(unwanted)]

print('Number of Unique HTML tags : ',len(final))

• 2 楼

Philip 5 年前

下面将从所讨论的URL中得到63个URL

import requests
import re

url = "https://en.wikipedia.org/wiki/Collatz_conjecture"
text = requests.get(url).text

url_pattern = r"((http(s)?://)([\w-]+\.)+[\w-]+[.com]+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)"

# Get all matching patterns of url_pattern
# this will return a list of tuples 
# where we are only interested in the first item of the tuple
urls = re.findall(url_pattern, text)

# using list comprehension to get the first item of the tuple, 
# and the set function to filter out duplicates
unique_urls = set([x[0] for x in urls])
print(f'Number of unique HTML tags: {len(unique_urls)} found on {url}')

输出:

Number of unique HTML tags: 63 found on https://en.wikipedia.org/wiki/Collatz_conjecture

登录后回复