很明显,有很多不同的方法可以得到相同的结果,你的问题似乎是,在mysql中,在每个组中,什么是获得最后结果的有效方法。如果你正在处理大量的数据,并且假设你将innodb与最新版本的mysql(比如5.7.21和8.0.4-rc)一起使用,那么可能没有一种有效的方法来实现这一点。
我们有时需要对行数超过6000万的表执行此操作。
对于这些示例,我将使用只有大约150万行的数据,其中查询需要查找数据中所有组的结果。在我们的实际案例中,我们通常需要返回大约2000个组的数据(假设不需要检查很多数据)。
我将使用下表:
CREATE TABLE temperature(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
groupID INT UNSIGNED NOT NULL,
recordedTimestamp TIMESTAMP NOT NULL,
recordedValue INT NOT NULL,
INDEX groupIndex(groupID, recordedTimestamp),
PRIMARY KEY (id)
);
CREATE TEMPORARY TABLE selected_group(id INT UNSIGNED NOT NULL, PRIMARY KEY(id));
温度表中有大约150万条随机记录和100个不同的组。
选定的_组由这100个组填充(在我们的情况下,对于所有组,这通常小于20%)。
由于此数据是随机的,这意味着多行可以具有相同的recordedtimestamp。我们需要的是按照group id的顺序获取所有选定组的列表,每个组都有最后一个recordedtimestamp,如果同一个组有多个这样的匹配行,那么这些行的最后一个匹配id。
如果假设mysql有一个last()函数,它在特殊的order by子句中返回最后一行的值,那么我们可以简单地执行以下操作:
SELECT
last(t1.id) AS id,
t1.groupID,
last(t1.recordedTimestamp) AS recordedTimestamp,
last(t1.recordedValue) AS recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
ORDER BY t1.recordedTimestamp, t1.id
GROUP BY t1.groupID;
在这种情况下,它只需要检查几百行,因为它不使用任何正常的group by函数。这将在0秒内执行,因此效率很高。
注意,通常在mysql中,我们会在group by子句后面看到orderby子句,但是这个orderby子句用于确定last()函数的顺序,如果它在groupby之后,那么它将对组进行排序。如果不存在GROUPBY子句,则所有返回行中的最后一个值都将相同。
但是mysql没有这个功能,所以让我们看看它有什么功能的不同想法,并证明这些都不是有效的。
例1
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT t2.id
FROM temperature t2
WHERE t2.groupID = g.id
ORDER BY t2.recordedTimestamp DESC, t2.id DESC
LIMIT 1
);
这项研究检查了3009254行,在5.7.21上花费了约0.859秒,在8.0.4-rc上花费了稍长的时间。
例2
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
INNER JOIN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
) t5 ON t5.id = t1.id;
研究了1505331行,在5.7.21和8.0.4-rc上分别花了约1.25秒和稍长的时间。
例3
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
WHERE t1.id IN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
)
ORDER BY t1.groupID;
这项研究检查了3009685行,在5.7.21上花费了约1.95秒,在8.0.4-rc上花费了稍长的时间。
例4
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT max(t2.id)
FROM temperature t2
WHERE t2.groupID = g.id AND t2.recordedTimestamp = (
SELECT max(t3.recordedTimestamp)
FROM temperature t3
WHERE t3.groupID = g.id
)
);
这项研究共检查了6137810行,在5.7.21上花费了约2.2秒,在8.0.4-rc上花费了稍长的时间。
例5
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
t2.id,
t2.groupID,
t2.recordedTimestamp,
t2.recordedValue,
row_number() OVER (
PARTITION BY t2.groupID ORDER BY t2.recordedTimestamp DESC, t2.id DESC
) AS rowNumber
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
) t1 WHERE t1.rowNumber = 1;
检查了6017808行,在8.0.4-rc上花费了约4.2秒。
例6
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
last_value(t2.id) OVER w AS id,
t2.groupID,
last_value(t2.recordedTimestamp) OVER w AS recordedTimestamp,
last_value(t2.recordedValue) OVER w AS recordedValue
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
WINDOW w AS (
PARTITION BY t2.groupID
ORDER BY t2.recordedTimestamp, t2.id
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) t1
GROUP BY t1.groupID;
检查了6017908行,在8.0.4-rc上花费了约17.5秒。
例7
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
LEFT JOIN temperature t2
ON t2.groupID = g.id
AND (
t2.recordedTimestamp > t1.recordedTimestamp
OR (t2.recordedTimestamp = t1.recordedTimestamp AND t2.id > t1.id)
)
WHERE t2.id IS NULL
ORDER BY t1.groupID;
这一次要花很长时间所以我不得不杀了它。