社区所有版块导航
Python
python开源   Django   Python   DjangoApp   pycharm  
DATA
docker   Elasticsearch  
aigc
aigc   chatgpt  
WEB开发
linux   MongoDB   Redis   DATABASE   NGINX   其他Web框架   web工具   zookeeper   tornado   NoSql   Bootstrap   js   peewee   Git   bottle   IE   MQ   Jquery  
机器学习
机器学习算法  
Python88.com
反馈   公告   社区推广  
产品
短视频  
印度
印度  
Py学习  »  Python

【机器学习】100个Python机器学习小技巧,让你速通ML

机器学习初学者 • 1 周前 • 30 次点击  

速查!用这些代码片段简化机器学习全流程。

Introduction To Machine Learning using Python | Codeyoung

构建机器学习模型是数据科学的关键环节,涉及运用算法进行数据预测或挖掘数据中的模式。

本文分享一系列简洁的代码片段,涵盖机器学习过程的各个阶段,从数据准备、模型选择,到模型评估和超参数调优。这些代码示例能帮助你使用诸如Scikit-Learn、XGBoost、CatBoost、LightGBM等库,完成常见的机器学习任务,还包含使用Hyperopt进行超参数优化、利用SHAP值进行模型解释等高级技术。

借助这些快速参考代码,你可以简化机器学习工作流程,在不同领域开发出高效的预测模型。

一、数据处理与探索

  1. 加载数据集:data = pd.read_csv('dataset.csv')
  2. 探索数据:data.head()data.info()data.describe()
  3. 处理缺失值:data.dropna()data.fillna()
  4. 编码分类变量:pd.get_dummies(data)
  5. 将数据拆分为训练集和测试集:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  6. 特征缩放:scaler = StandardScaler()X_scaled = scaler.fit_transform(X)

二、模型初始化、训练与评估

  1. 初始化模型:model = RandomForestClassifier()
  2. 训练模型:model.fit(X_train, y_train)
  3. 进行预测:predictions = model.predict(X_test)
  4. 评估准确率:accuracy_score(y_test, predictions)
  5. 混淆矩阵:conf_matrix = confusion_matrix(y_test, predictions)
  6. 分类报告:class_report = classification_report(y_test, predictions)
  7. 交叉验证:cv_scores = cross_val_score(model, X, y, cv=5)
  8. 超参数调优:grid_search = GridSearchCV(model, param_grid, cv=5)grid_search.fit(X, y)
  9. 特征重要性:feature_importance = model.feature_importances_
  10. 保存模型:joblib.dump(model,'model.pkl')
  11. 加载模型:loaded_model = joblib.load('model.pkl')

三、降维和聚类

  1. 主成分分析:pca = PCA(n_components=2)X_pca = pca.fit_transform(X)
  2. 降维:pca = PCA(n_components=2)X_pca = pca.fit_transform(X)
  3. K均值聚类:kmeans = KMeans(n_clusters=3)kmeans.fit(X)labels = kmeans.labels_
  4. 手肘法:Sum_of_squared_distances = []for k in range(1,11): kmeans = KMeans(n_clusters=k)kmeans.fit(X)Sum_of_squared_distances.append(kmeans.inertia_)
  5. 轮廓系数:silhouette_avg = silhouette_score(X, labels)

四、各类分类模型

  1. 决策树:dt_model = DecisionTreeClassifier()dt_model.fit(X_train, y_train)
  2. 支持向量机:svm_model = SVC()svm_model.fit(X_train, y_train)
  3. 朴素贝叶斯:nb_model = GaussianNB()nb_model.fit(X_train, y_train)
  4. K近邻分类:knn_model = KNeighborsClassifier()knn_model.fit(X_train, y_train)
  5. 近邻回归:KNeighborsRegressor(n_neighbors=5).fit(X_train, y_train)
  6. 逻辑回归:logreg_model = LogisticRegression()logreg_model.fit(X_train, y_train)
  7. 岭回归:ridge_model = Ridge()ridge_model.fit(X_train, y_train)
  8. 套索回归:lasso_model = Lasso()lasso_model.fit(X_train, y_train)
  9. 集成方法: ensemble_model = VotingClassifier(estimators=[('clf1', clf1), ('clf2', clf2)], voting='soft')ensemble_model.fit(X_train, y_train)
  10. 装袋法:bagging_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100)bagging_model.fit(X_train, y_train)
  11. 随机森林:rf_model = RandomForestClassifier(n_estimators=100)rf_model.fit(X_train, y_train)
  12. 梯度提升:gb_model = GradientBoostingClassifier()gb_model.fit(X_train, y_train)
  13. AdaBoost:adaboost_model = AdaBoostClassifier()adaboost_model.fit(X_train, y_train)
  14. XGBoost:xgb_model = xgb.XGBClassifier()xgb_model.fit(X_train, y_train)
  15. LightGBM:lgb_model = lgb.LGBMClassifier()lgb_model.fit(X_train, y_train)
  16. CatBoost:catboost_model = CatBoostClassifier()catboost_model.fit(X_train, y_train)

五、模型评估指标

  1. ROC曲线:fpr, tpr, thresholds = roc_curve(y_test, predictions_prob[:,1])
  2. ROC曲线下面积:roc_auc = roc_auc_score(y_test, predictions_prob[:,1])
  3. 精确率 - 召回率曲线:precision, recall, thresholds = precision_recall_curve(y_test, predictions_prob[:,1])
  4. 精确率 - 召回率曲线下面积:pr_auc = auc(recall, precision)
  5. F1分数:f1 = f1_score(y_test, predictions)
  6. 受试者工作特征曲线AUC:roc_auc = roc_auc_score(y_test, predictions_prob[:,1])
  7. 均方误差:mse = mean_squared_error(y_test, predictions)
  8. 决定系数(R²):r2 = r2_score(y_test, predictions)

六、交叉验证和采样技术

  1. 分层采样:stratified_kfold = StratifiedKFold(n_splits=5)
  2. 时间序列分割:time_series_split = TimeSeriesSplit(n_splits=5)
  3. 重采样(欠采样):rus = RandomUnderSampler()X_resampled, y_resampled = rus.fit_resample(X, y)
  4. 重采样(过采样):ros = RandomOverSampler()X_resampled, y_resampled = ros.fit_resample(X, y)
  5. SMOTE(合成少数过采样技术):smote = SMOTE()X_resampled, y_resampled = smote.fit_resample(X, y)
  6. 类别权重:class_weight='balanced'
  7. 交叉验证中的分层采样:stratified_cv = StratifiedKFold(n_splits=5)

七、特征工程与转换

  1. 学习曲线:plot_learning_curve(model, X, y)
  2. 验证曲线:plot_validation_curve(model, X, y, param_name='param', param_range=param_range)
  3. 提前停止(以XGBoost为例):early_stopping_rounds=10
  4. 特征缩放:scaler = MinMaxScaler(feature_range=(0, 1))X_scaled = scaler.fit_transform(X)
  5. 独热编码:data_encoded = pd.get_dummies(data)
  6. 标签编码:label_encoder = LabelEncoder()data['label_encoded'] = label_encoder.fit_transform(data['label'])
  7. 数据归一化:scaler = StandardScaler()X_normalized = scaler.fit_transform(X)
  8. 数据标准化:scaler = MinMaxScaler()X_standardized = scaler.fit_transform(X)
  9. 数据变换:X_transformed = np.log1p(data)
  10. 异常值检测:iso_forest = IsolationForest()outliers = iso_forest.fit_predict(X)
  11. 异常检测:envelope = EllipticEnvelope(contamination=0.01)outliers = envelope.fit_predict(X)
  12. 数据插补:imputer = SimpleImputer(strategy='mean')X_imputed = imputer.fit_transform(X)
  13. 多项式回归:poly = PolynomialFeatures(degree=2)X_poly = poly.fit_transform(X)

八、回归模型与技术

  1. L1正则化:lasso = Lasso(alpha=1.0)lasso.fit(X_train, y_train)
  2. L2正则化:ridge = Ridge(alpha=1.0)ridge.fit(X_train, y_train)
  3. Huber回归:huber = HuberRegressor()huber.fit(X_train, y_train)
  4. 分位数回归:quantile_reg = QuantReg(y_train, X_train)quantile_result = quantile_reg.fit(q=0.5)
  5. 稳健回归:ransac = RANSACRegressor()ransac.fit(X_train, y_train)

九、自动化机器学习和高级技术

  1. 使用TPOT进行自动化机器学习:tpot = TPOTClassifier()tpot.fit(X_train, y_train)
  2. 使用H2O进行自动化机器学习:h2o_automl = H2OAutoML(max_models=10, seed=1)h2o_automl.train(x=X_train.columns, y='target', training_frame=train)

十、绘图与可视化

  1. 保存绘图:plt.savefig('plot.png')
  2. 绘制特征重要性图:plot_feature_importance(model)
  3. K均值聚类可视化:plt.scatter(X[:, 0], X[:, 1], c=KMeans(n_clusters=3).fit_predict(X), cmap='viridis')

十一、其他

  1. 交叉验证预测:cv_predictions = cross_val_predict(model, X, y, cv=5)
  2. 自定义评估指标:custom_metric = custom_metric(y_true, y_pred)
  3. 使用scikit-learn进行特征选择:kbest = SelectKBest(chi2, k=5)X_selected = kbest.fit_transform(X, y)
  4. 带交叉验证的递归特征消除:rfecv = RFECV(estimator=DecisionTreeClassifier(), step=1, cv=5)X_rfecv = rfecv.fit_transform(X, y)
  5. 多项式回归次数:poly = PolynomialFeatures(degree=2)X_poly = poly.fit_transform(X)
  6. 处理类别不平衡问题:class_weight='balanced'
  7. AdaBoost中的学习率:learning_rate=0.1
  8. 用于确保可重复性的随机种子:random_state=42
  9. 岭回归的alpha参数:ridge = Ridge(alpha=1.0)ridge.fit(X_train, y_train)
  10. 套索回归的alpha参数:lasso = Lasso(alpha=1.0)lasso.fit(X_train, y_train)
  11. 决策树的最大深度:dt_model = DecisionTreeClassifier(max_depth=3)dt_model.fit(X_train, y_train)
  12. K近邻的参数:knn_model = KNeighborsClassifier(n_neighbors=5)knn_model.fit(X_train, y_train)
  13. 支持向量机的核参数:svm_model = SVC(kernel='rbf')svm_model.fit(X_train, y_train)
  14. 随机森林的估计器数量:rf_model = RandomForestClassifier(n_estimators=100)rf_model.fit(X_train, y_train)
  15. 梯度提升的学习率:gb_model = GradientBoostingClassifier(learning_rate=0.1)gb_model.fit(X_train, y_train)
  16. 使用网格搜索的Huber回归:GridSearchCV(HuberRegressor(), {'epsilon': [1.1, 1.2, 1.3]}, cv=5).fit(X_train, y_train)
  17. 带交叉验证的岭回归:RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5).fit(X_train, y_train)
  18. 模型堆叠:stacked_model = StackingClassifier(classifiers=[clf1, clf2], meta_classifier=meta_clf)stacked_model.fit(X_train, y_train)

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/182537
 
30 次点击