它是由Matthias Feurer等人开发的。并在他们 2015 年题为“efficient and robust automated machine learning 高效且稳健的自动化机器学习[1]”的论文中进行了描述。
… we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters)
This system, which we dub AUTO-SKLEARN, improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization.
这种元学习方法是贝叶斯优化的补充,用于优化 ML 框架。对于像整个 ML 框架一样大的超参数空间,贝叶斯优化的启动速度很慢。通过基于元学习选择若干个配置来用于种子贝叶斯优化。这种通过元学习的方法可以称为热启动优化方法。再配合多个模型的自动集成方法,使得整个机器学习流程高度自动化,将大大节省用户的时间。从这个流程来看,让机器学习使用者可以有更多的时间来选择数据以及思考要处理的问题本身。
# summarize the sonar dataset from pandas import read_csv # load dataset dataframe = read_csv(data, header=None) # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape)
# evaluate best model y_pred = model.predict(X_test) acc = accuracy_score(y_test, y_pred) print("Accuracy: %.3f" % acc)
还可以打印集成的最终分数和混淆矩阵。
# Score of the final ensemble from sklearn.metrics import accuracy_score m1_acc_score= accuracy_score(y_test, y_pred) m1_acc_score from sklearn.metrics import confusion_matrix, accuracy_score y_pred= model.predict(X_test) conf_matrix= confusion_matrix(y_pred, y_test) sns.heatmap(conf_matrix, annot=True)
在运行结束时,会打印一个摘要,显示评估了 1,054 个模型,最终模型的估计性能为 91%。
auto-sklearn results: Dataset name: f4c282bd4b56d4db7e5f7fe1a6a8edeb Metric: accuracy Best validation score: 0.913043 Number of target algorithm runs: 1054 Number of successful target algorithm runs: 952 Number of crashed target algorithm runs: 94 Number of target algorithms that exceeded the time limit: 8 Number of target algorithms that exceeded the memory limit: 0
默认情况下,回归器将优化 指标。如果需要使用平均绝对误差或 MAE ,可以在调用fit()函数时通过metric参数指定它。
# example of auto-sklearn for the insurance regression dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error from autosklearn.regression import AutoSklearnRegressor from autosklearn.metrics import mean_absolute_error as auto_mean_absolute_error # load dataset
dataframe = read_csv(data, header=None) # split into input and output elements data = dataframe.values data = data.astype('float32') X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # define search model = AutoSklearnRegressor(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8) # perform the search model.fit(X_train, y_train, metric=auto_mean_absolute_error) # summarize print(model.sprint_statistics()) # evaluate best model y_hat = model.predict(X_test) mae = mean_absolute_error(y_test, y_hat) print("MAE: %.3f" % mae)
auto-sklearn results: Dataset name: ff51291d93f33237099d48c48ee0f9ad Metric: mean_absolute_error Best validation score: 29.911203 Number of target algorithm runs: 1759 Number of successful target algorithm runs: 1362 Number of crashed target algorithm runs: 394 Number of target algorithms that exceeded the time limit: 3 Number of target algorithms that exceeded the memory limit: 0