python:过滤函数不会影响列表中的所有项[重复]

gigamarr • 6 年前 • 1626 次点击

我有一个字符串列表,我只想保留最独特的字符串。这是我如何实现的(也许循环有问题)。

def filter_descriptions(descriptions):
    MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar
    i = 0
    while i < len(descriptions):
        print("Processing {}/{}...".format(i + 1, len(descriptions)))
        desc_to_evaluate = descriptions[i]
        j = i + 1
        while j < len(descriptions):
            similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
            if similarity_ratio > MAX_SIMILAR_ALLOWED:
                del descriptions[j]
            j += 1
        i += 1
    return descriptions

请注意名单上可能有 110 K项目 这就是为什么我每次迭代都要缩短列表。

有人能指出目前的实施有什么问题吗?

编辑1:

目前的结果“太相似”。这个 filter_descriptions 函数返回16个项(从约110k个项的列表中)。当我试着做以下事情时,

SequenceMatcher(None, descriptions[0], descriptions[1]).ratio()

这个比率是0.99, SequenceMatcher(None, descriptions[1], descriptions[2]).ratio() 大约是0.98。但与 SequenceMatcher(None, descriptions[0], descriptions[15]).ratio() 大约是0.65(更好)

我希望这能有帮助。

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/44315

文章 [ 3 ] | 最新文章 6 年前

• 1 楼

Mujeeb 7 年前

价值 j 当从列表中删除某个项时不应更改(因为下一次迭代中将在该位置出现另一个列表项)。做 j=i+1 每次删除项目时重新启动迭代(这不是所需的)。更新后的代码现在只会增加 J 在其他条件下。

def filter_descriptions(descriptions):
    MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar
    i = 0
    while i < len(descriptions):
        print("Processing {}/{}...".format(i + 1, len(descriptions)))
        desc_to_evaluate = descriptions[i]
        j = i + 1
        while j < len(descriptions):
            similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
            if similarity_ratio > MAX_SIMILAR_ALLOWED:
                del descriptions[j]
            else:
                j += 1
        i += 1
    return descriptions

• 2 楼

John Doe 7 年前

逻辑上的问题是,每次从数组中删除一个项时,索引都会重新排列并跳过其中的一个字符串。如:

假设这是数组: 说明:[“A”,“A”,“A”,“B”,“C”]

第1部分:

i=0                      -------------0
description[i]="A"
j=i+1                    -------------1
description[j]="A"
similarity_ratio>0.6
del description[j]

现在数组被重新索引如下: 描述:[“A”,“A”,“B”,“C”]。下一步是:

 j=j+1                   ------------1+1= 2

说明[2]=“B”

您已跳过说明[1]=“A”

要解决此问题: 替换

j+=1

用

j=i+1

如果删除。否则进行正常的j=j+1迭代

• 3 楼

Dunes 7 年前

如果颠倒逻辑,则可以避免修改列表,同时仍然减少所需的比较次数。也就是说,从一个空的输出/唯一列表开始,遍历您的描述,看看是否可以添加每个描述。因此,对于第一个描述,您可以立即添加它,因为它不能与空列表中的任何内容相似。第二个描述只需要与第一个描述进行比较,而不是与所有其他描述进行比较。以后的迭代只要找到与之相似的先前描述,就可能短路(并丢弃候选描述)。IE.

import operator

def unique(items, compare=operator.eq):
    # compare is a function that returns True if its two arguments are deemed similar to 
    # each other and False otherwise.

    unique_items = []
    for item in items:
        if not any(compare(item, uniq) for uniq in unique_items):
            # any will stop as soon as compare(item, uniq) returns True
            # you could also use `if all(not compare(item, uniq) ...` if you prefer
            unique_items.append(item)

    return unique_items

实例:

assert unique([2,3,4,5,1,2,3,3,2,1]) == [2, 3, 4, 5, 1]
# note that order is preserved

assert unique([1, 2, 0, 3, 4, 5], compare=(lambda x, y: abs(x - y) <= 1))) == [1, 3, 5]
# using a custom comparison function we can exclude items that are too similar to previous
# items. Here 2 and 0 are excluded because they are too close to 1 which was accepted
# as unique first. Change the order of 3 and 4, and then 5 would also be excluded.

对于您的代码,比较函数将如下所示:

MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar

def description_cmp(candidate_desc, unique_desc):
    # use unique_desc as first arg as this keeps the argument order the same as with your filter 
    # function where the first description is the one that is retained if the two descriptions 
    # are deemed to be too similar
    similarity_ratio = SequenceMatcher(None, unique_desc, candidate_desc).ratio()
    return similarity_ratio > MAX_SIMILAR_ALLOWED

def filter_descriptions(descriptions):
    # This would be the new definition of your filter_descriptions function
    return unique(descriptions, compare=descriptions_cmp)

比较的次数应该完全相同。也就是说,在您的实现中,第一个元素与所有其他元素进行比较,第二个元素仅与被认为与第一个元素不相似的元素进行比较,依此类推。在这个实现中,第一个项最初不会与任何内容进行比较,但是必须将所有其他项与之进行比较,才能允许将其添加到唯一列表中。只有被认为与第一个项目不相似的项目才会与第二个唯一项目进行比较,以此类推。

这个 unique 实现只需在备份阵列空间不足时复制唯一列表,因此复制操作将更少。鉴于, del 每次使用时都必须复制列表的语句部分(将所有后续项移动到其新的正确位置)。不过,这对性能的影响可能可以忽略不计,因为瓶颈可能是序列匹配器中的比率计算。

登录后回复