如果颠倒逻辑,则可以避免修改列表,同时仍然减少所需的比较次数。也就是说,从一个空的输出/唯一列表开始,遍历您的描述,看看是否可以添加每个描述。因此,对于第一个描述,您可以立即添加它,因为它不能与空列表中的任何内容相似。第二个描述只需要与第一个描述进行比较,而不是与所有其他描述进行比较。以后的迭代只要找到与之相似的先前描述,就可能短路(并丢弃候选描述)。IE.
import operator
def unique(items, compare=operator.eq):
# compare is a function that returns True if its two arguments are deemed similar to
# each other and False otherwise.
unique_items = []
for item in items:
if not any(compare(item, uniq) for uniq in unique_items):
# any will stop as soon as compare(item, uniq) returns True
# you could also use `if all(not compare(item, uniq) ...` if you prefer
unique_items.append(item)
return unique_items
实例:
assert unique([2,3,4,5,1,2,3,3,2,1]) == [2, 3, 4, 5, 1]
# note that order is preserved
assert unique([1, 2, 0, 3, 4, 5], compare=(lambda x, y: abs(x - y) <= 1))) == [1, 3, 5]
# using a custom comparison function we can exclude items that are too similar to previous
# items. Here 2 and 0 are excluded because they are too close to 1 which was accepted
# as unique first. Change the order of 3 and 4, and then 5 would also be excluded.
对于您的代码,比较函数将如下所示:
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
def description_cmp(candidate_desc, unique_desc):
# use unique_desc as first arg as this keeps the argument order the same as with your filter
# function where the first description is the one that is retained if the two descriptions
# are deemed to be too similar
similarity_ratio = SequenceMatcher(None, unique_desc, candidate_desc).ratio()
return similarity_ratio > MAX_SIMILAR_ALLOWED
def filter_descriptions(descriptions):
# This would be the new definition of your filter_descriptions function
return unique(descriptions, compare=descriptions_cmp)
比较的次数应该完全相同。也就是说,在您的实现中,第一个元素与所有其他元素进行比较,第二个元素仅与被认为与第一个元素不相似的元素进行比较,依此类推。在这个实现中,第一个项最初不会与任何内容进行比较,但是必须将所有其他项与之进行比较,才能允许将其添加到唯一列表中。只有被认为与第一个项目不相似的项目才会与第二个唯一项目进行比较,以此类推。
这个
unique
实现只需在备份阵列空间不足时复制唯一列表,因此复制操作将更少。鉴于,
del
每次使用时都必须复制列表的语句部分(将所有后续项移动到其新的正确位置)。不过,这对性能的影响可能可以忽略不计,因为瓶颈可能是序列匹配器中的比率计算。