【机器学习】带有源代码的 10 个 GitHub 数据科学项目

截至 2023 年，世界上生成的数据已超过 120 ZB！这远远超出了我们的想象。更令人惊讶的是，这个数字将在未来两年内超过180！这就是数据科学快速发展的原因，需要热爱数据和处理数据的熟练专业人士。

如果你正在考虑进军基于数据的职业，最好的方法之一是参与GitHub数据科学项目，建立一个数据科学家组合，展示你的技能和经验。

因此，如果你对数据科学充满热情并渴望探索新的数据集和技术，请阅读并探索你可以贡献的十大数据科学项目。

探索安然电子邮件数据集
用机器学习预测房价
识别欺诈性信用卡交易
使用卷积神经网络进行图像分类
Twitter 数据的情感分析
分析 Netflix 电影和电视节目
使用 K-Means 聚类进行客户细分
深度学习医学诊断
使用机器学习进行音乐流派分类
用逻辑回归预测信用风险

在 GitHub 上为数据科学项目做出贡献的最佳实践
如何在 GitHub 上展示你的数据科学项目？
结论
经常问的问题

适合初学者的 10 个 GitHub 数据科学项目列表

1. 探索安然电子邮件数据集

第一个项目是探索安然电子邮件数据集。这将使你对标准数据科学任务有一个初步的了解。数据集链接：https://github.com/Mithileysh/Email-Datasets

问题陈述

该项目旨在探索安然公司（内部通信）的电子邮件数据集，该公司因一场导致公司破产的大规模公司欺诈而闻名于世。我们将寻找模式并对电子邮件进行分类，以尝试检测欺诈性电子邮件。

该项目和安然电子邮件数据集的简要概述

让我们从了解数据开始。该数据集属于安然公司语料库，这是一个庞大的数据库，包含安然公司员工的60多万封电子邮件。该数据集为数据科学家提供了一个机会，通过研究安然欺诈案，更深入地研究最大的企业欺诈行为之一。

项目分步指南

克隆原始存储库并熟悉安然数据集：此步骤包括查看数据集或提供的任何文档、了解数据类型并跟踪元素。
在介绍性分析之后，你将继续进行数据预处理。鉴于它是一个广泛的数据集，将会有很多噪音（不必要的元素），需要进行数据清理。你可能还需要解决数据集中缺失的值。
预处理后，你应该执行EDA（探索性数据分析）。这可能涉及创建可视化以更好地理解数据的分布。
你还可以进行统计分析来识别数据元素或异常之间的相关性。

下面列出了一些可以帮助你研究安然电子邮件数据集的相关 GitHub 存储库：

安然语料库数据的欺诈检测：https://github.com/geekquad/Fraud-Detection
安然数据集的探索性分析和分类：https://github.com/ManasviGoyal/Enron-Classification

代码片段：

2. 用机器学习预测房价

预测房价是 GitHub 上最受欢迎的数据分析师项目之一。

问题陈述

该项目的目标是根据多种因素预测房价并研究它们之间的关系。完成后，你将能够解释每个因素如何影响房价。

项目概况及房价数据集

在这里，你将使用具有超过 13 个特征的数据集，包括 ID（用于统计记录）、区域、面积（地块大小（以平方英尺为单位））、建筑类型（住宅类型）、建造年份、改造年份（如果有效）、销售价格（待预测）以及其他一些信息。

数据集链接：https://docs.google.com/spreadsheets/d/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs/edit?usp=sharing&ouid=115253717745408081083&rtpof=true&sd=true

项目分步指南

在进行机器学习项目时，你将执行以下流程。

与任何其他 GitHub 项目一样，你将首先探索数据集的数据类型、关系和异常。
下一步将是根据你的要求预处理数据、减少噪音并填充缺失值（或删除相应条目）。
由于预测房价涉及多个特征，因此特征工程至关重要。这可能包括通过现有变量的组合创建新变量，以及选择适当的变量等技术。
下一步是通过探索不同的 ML 模型（如线性回归、决策树、神经网络等）来选择最合适的 ML 模型。
最后，你将根据均方根误差、R 方值等指标评估所选模型，以了解模型的性能。

下面列出了一些可以帮助你预测房价的相关 GitHub 存储库：

使用正则化线性回归预测房价：https://github.com/SouravG/Housing-price-prediction-using-Regularised-linear-regression
用于房价预测的高级回归技术：https://github.com/tatha04/Housing-Prices-Advanced-Regression-Techniques

代码片段：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing_df = pd.read_csv('housing_data.csv')
housing_df = housing_df.drop(['MSZoning', 'LotConfig', 'BldgType', 'Exterior1st'], axis=1)

housing_df = housing_df.dropna(subset=['BsmtFinSF2', 'TotalBsmtSF', 'SalePrice'])

X = housing_df.drop('SalePrice', axis=1)
y = housing_df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

3. 识别欺诈性信用卡交易

信用卡交易中的欺诈检测是实践 GitHub 数据科学项目的一个绝佳领域。它将让你熟练地识别数据模式和异常。

问题陈述

这个 GitHub 数据科学项目旨在检测包含信用卡交易信息的数据模式。结果应该为你提供所有欺诈交易所共有的某些特征/模式。

项目和数据集的简要概述

在此 GitHub 项目中，你可以使用任何信用卡交易数据集，例如包含 2013 年 9 月进行的交易的欧洲持卡人数据。该数据集包含 284,807 笔总交易中的超过 492 笔欺诈交易。这些特征用 V1、V2、… 等表示。

数据集链接：https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?select=creditcard.csv

项目分步指南

你将从数据探索开始，以了解结构并使用 Pandas 库检查数据集中是否有缺失值。
一旦熟悉了数据集，就可以预处理数据，处理缺失值，删除不必要的变量，并通过特征工程创建新特征。
下一步是训练机器学习模型。考虑不同的算法，如支持向量机、随机森林、回归等，并对它们进行微调以获得最佳结果。
根据召回率、准确率、F1 分数等各种指标评估其性能。

下面列出了一些可以帮助你检测欺诈性信用卡交易的相关 GitHub 存储库。

匿名信用卡交易的欺诈检测模型：https://github.com/sagnikghoshcr7/Credit-Card-Fraud-Detection
欺诈检测论文：https://github.com/benedekrozemberczki/awesome-fraud-detection-papers

代码片段：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

creditcard_df = pd.read_csv('creditcard_data.csv')

X = creditcard_df.drop('Class', axis = 1)
y = creditcard_df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

rf = RandomForestClassifier(n_estimators=100, random_state = 42)

rf.fit(X_train, y_train)

4. 使用卷积神经网络进行图像分类

我们的 GitHub 数据科学项目列表中的另一个项目重点关注使用 CNN（卷积神经网络）进行图像分类。CNN 是神经网络的一种子类型，具有内置卷积层，可在不影响信息/质量的情况下降低图像的高维性。

问题陈述

该项目的目的是使用卷积神经网络根据某些特征对图像进行分类。完成后，你将深入了解 CNN 如何熟练地处理图像数据集进行分类。

项目和数据集的简要概述

在此项目中，你可以通过根据特定关键字从 URL 抓取图像数据来使用 Bing 图像数据集。你将需要使用 Python 和 Bing 的多线程功能，在提示窗口中使用 pip install bing-images 命令并导入“bing”来获取图像 URL。

图像分类分步指南

你将首先过滤搜索你想要分类的图像类型。它可以是任何东西，例如猫或狗。通过多线程功能批量下载图像。
接下来是数据组织和预处理。通过将图像大小调整为统一大小并根据需要将其转换为灰度来对图像进行预处理。
将数据集拆分为测试集和训练集。训练集训练 CNN 模型，而验证集则监控训练过程。
定义 CNN 模型的架构。你还可以向模型添加功能，例如批量标准化。这可以防止过度拟合。
使用 Adam 或 SGD 等合适的优化器在训练集上训练 CNN 模型并评估其性能。

下面列出了一些相关的 GitHub 存储库，可帮助你使用 CNN 对图像进行分类。

从 Bing 获取图像：https://github.com/CatchZeng/bing_images
使用基于 CNN 的图像分类进行 Web 应用程序垃圾分类：https://github.com/vladalexey/webapp-trash-classification

代码片段：




    
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout
from keras.utils import np_utils

# Load the dataset
(X_train, y_train), (X_test, y_test) = ‘dataset’.load_data()


# One-hot encode target variables
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

# Define the model architecture
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=X_train.shape[1:]))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, batch_size=128, epochs=20, validation_data=(X_test, y_test))

# Evaluate the model on the test set
scores = model.evaluate(X_test, y_test, verbose=0)
print("Test Accuracy:", scores[1])

5. Twitter数据的情感分析

问题陈述

我们有必要分析网上发布的内容背后的情绪。该项目旨在使用 NLP（自然语言处理）来研究和分析最受欢迎的社交网络 Twitter 背后的情绪。

项目和数据集的简要概述

在此 GitHub 数据科学项目中，你将使用 Streaming Twitter API、Python、MySQL 和 Tweepy 收集 Twitter 数据。然后，你将执行情感分析来识别特定的情绪和观点。通过监控这些情绪，你可以帮助个人或组织在客户参与和体验方面做出更好的决策，即使是初学者。

你可以使用包含超过 160 万条推文的 Sentiment 140 数据集。

数据集链接：https://www.kaggle.com/datasets/kazanova/sentiment140?select=training.1600000.processed.noemoticon.csv

项目分步指南

第一步是使用 Twitter 的 API 收集基于特定关键字、用户或推文的数据。获得数据后，删除不必要的噪音和其他不相关的元素（例如特殊字符）。
你还可以删除某些停用词（不会增加太多价值的单词）、“the”、“and”等。此外，你还可以执行词形还原。词形还原是指将单词的不同形式转换为单一形式；例如，“eat”、“eating”和“eats”变成“eat”。
基于 NLP 的分析的下一个重要步骤是分词。简而言之，你将把数据分解为更小的标记单元或单个单词。这样可以更容易地将意义分配给构成整个文本的小块。
一旦数据被分词后，下一步就是使用机器学习模型对每个标记的情绪进行分类。你可以使用随机森林分类器、朴素贝叶斯或 RNN 来实现同样的目的。

下面列出了一些相关的 GitHub 存储库，可帮助你分析 Twitter 数据中的情绪。

Twitter 上的实时情绪跟踪以改善品牌：https://github.com/Chulong-Li/Real-time-Sentiment-Tracking-on-Twitter-for-Brand-Improvement-and-Trend-Recognition/blob/master/sample_data.csv
Twitter 数据的积极/消极情绪分析：https://github.com/the-javapocalypse/Twitter-Sentiment-Analysis

代码片段：

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
import string
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Load the dataset
data = pd.read_csv('tweets.csv', encoding='latin-1', header=None)

# Assign new column names to the DataFrame
column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
data.columns = column_names

# Preprocess the text data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()




    
def preprocess_text(text):
    # Remove URLs, usernames, and hashtags
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)

    # Remove punctuation and convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()

    # Tokenize the text and remove stop words
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Lemmatize the tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join the tokens back into text
    preprocessed_text = ' '.join(lemmatized_tokens)
    return preprocessed_text

data['text'] = data['text'].apply(preprocess_text)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['target'], test_size=0.2, random_state=42)

# Vectorize the text data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# Train the model
clf = MultinomialNB().fit(X_train_tfidf, y_train)

# Test the model
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)

# Print the classification report
print(classification_report(y_test, y_pred))

输出：

6. 分析 Netflix 电影和电视节目

Netflix 可能是每个人最喜欢的电影流媒体服务。这个 GitHub 数据科学项目基于分析 Netflix 电影和电视节目。

问题陈述

该项目的目标是对 Netflix 用户数据运行数据分析工作流程，包括 EDA、可视化和解释。

项目和数据集的简要概述

该数据科学项目旨在磨练你的技能，并使用 Matplotlib、Seaborn 和 worldcloud 等库以及 Tableau 等工具以可视化方式创建和解释 Netflix 数据。

同样，你可以使用 Kaggle 上提供的 Netflix Original Films 和 IMDb 分数数据集。它包含截至 2021 年 6 月 1 日发布的所有 Netflix 原创作品及其相应的 IMDb 评级。

数据集链接：https://www.kaggle.com/datasets/luiscorter/netflix-original-films-imdb-scores

分析 Netflix 电影的分步指南

下载数据集后，通过删除不必要的噪音和停用词（例如“the”、“an”和“and”）来预处理数据集。
然后是清理数据的标记化。此步骤涉及将较大的句子或段落分解为较小的单元或单个单词。
你还可以使用词干提取/词形还原将不同形式的单词转换为单个项目。例如，“sleep”和“sleeping”就变成了“sleep”。
对数据进行预处理和词形还原后，你可以使用计数向量化器、tfidf 等从文本中提取特征，然后使用机器学习算法对情感进行分类。你可以使用随机森林、SVM 或 RNN 来实现同样的目的。
创建可视化并研究模式和趋势，例如一年内发行的电影数量、热门类型等。
该项目可以扩展到文本分析。分析电影和电视节目的标题、导演和演员。
你可以使用生成的见解来创建建议。

下面列出了一些可以帮助你分析 Netflix 电影和电视节目的相关 GitHub 存储库。

Netflix 数据的探索性数据分析：https://github.com/MelihGulum/Exploratory-Data-Analysis-EDA/blob/main/Netflix_Originals_EDA.ipynb
Netflix 可视化和推荐：https://github.com/NikosMav/DataAnalysis-Netflix

代码片段：

import pandas as pd
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

# Load the Netflix dataset
netflix_data = pd.read_csv('netflix_titles.csv', encoding='iso-8859-1')

# Create a new column for sentiment scores of movie and TV show titles
sia = SentimentIntensityAnalyzer()
netflix_data['sentiment_scores'] = netflix_data['Title'].apply(lambda x: sia.polarity_scores(x))

# Extract the compound sentiment score from the sentiment scores dictionary
netflix_data['sentiment_score'] = netflix_data['sentiment_scores'].apply(lambda x: x['compound'])

# Group the data by language and calculate the average sentiment score for movies and TV shows in each language
language_sentiment = netflix_data.groupby('Language')['sentiment_score'].mean()

# Print the top 10 languages with the highest average sentiment score for movies and TV shows
print(language_sentiment.sort_values(ascending=False).head(10))

输出：

7. 使用 K-Means 聚类进行客户细分

客户细分是数据科学最重要的应用之一。此 GitHub 数据科学项目将要求你使用 K 聚类算法。这种流行的无监督机器学习算法根据相似性将数据点聚类为 K 个簇。

问题陈述

该项目的目标是使用 K 均值聚类算法，根据年收入、消费习惯等特定因素对访问购物中心的顾客进行细分。

项目和数据集的简要概述

该项目将要求你收集数据、进行初步研究和数据预处理，并训练和测试 K 均值聚类模型来细分客户。你可以使用商城客户细分上的数据集，其中包含 5 个特征（客户 ID、性别、年龄、年收入和消费分数）以及 200 个客户的相应信息。

数据集链接：https://www.kaggle.com/nelakurthisudheer/mall-customer-segmentation

项目分步指南

请按照以下步骤操作：

加载数据集，导入所有必需的包，然后探索数据。
熟悉数据后，通过删除重复或不相关的数据、处理缺失值以及格式化数据以进行分析来清理数据集。
选择所有相关特征。这可能包括年收入、支出分数、性别等。
在预处理数据上训练 K-Means 聚类模型，以根据这些特征识别客户群。然后，你可以使用 Seaborn 可视化客户群并制作散点图、热图等。
最后，分析客户群以深入了解客户行为。

下面列出了一些可以帮助你细分客户的相关 GitHub 存储库。

商场客户的客户细分：https://github.com/NelakurthiSudheer/Mall-Customers-Segmentation
在客户数据上演示 k 均值算法：https://github.com/mayursrt/customer-segmentation-using-k-means

代码片段：

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the customer data
customer_data = pd.read_csv('customer_data.csv')
customer_data = customer_data.drop('Gender', axis=1)
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(customer_data)

# Find the optimal number of clusters using the elbow method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(scaled_data)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

# Perform K-Means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans.fit(scaled_data)

# Add the cluster labels to the original DataFrame
customer_data['Cluster'] = kmeans.labels_

# Plot the clusters based on age and income
plt.scatter(customer_data['Age'], customer_data['Annual Income (k$)'], c=customer_data['Cluster'])
plt.title('Customer Segmentation')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()

8. 深度学习医学诊断

深度学习是机器学习的一个相对新兴的分支，由多层神经网络组成。由于其高计算能力，它被广泛用于复杂的应用。因此，从事 Github 数据科学项目（包括深度学习）对于你在 Github 上的数据分析师作品集非常有帮助。

问题陈述

这个 GitHub 数据科学项目旨在使用深度学习卷积模型识别胸部 X 射线中的不同病理。完成后，你应该了解如何在放射学中使用深度学习/机器学习。

项目和数据集的简要概述

在这个数据科学顶点项目中，你将使用 GradCAM 模型解释方法，并使用胸部 X 光来诊断超过 14 种病理，如气胸、水肿、心脏肥大等。目标是利用基于深度学习的 DenseNet- 121 个分类模型。

我们将使用公共胸部 X 光数据集，其中包含超过 32,717 名患者的超过 108,948 张正面 X 光片。约 1000 张图像的子集对于该项目来说就足够了。

数据集链接：https://arxiv.org/abs/1705.02315

项目分步指南

下载数据集。获得数据后，你必须通过调整图像大小、标准化像素等对其进行预处理。这样做是为了确保你的数据已准备好进行训练。
下一步是使用 PyTorch 或 TensorFlow 训练深度学习模型 DenseNet121。
使用该模型，你可以预测病理和其他潜在问题（如果有）。
你可以根据 F1 分数、精度和准确度指标评估你的模型。如果训练正确，模型的准确度可高达 0.9（理想情况是最接近 1）。

下面列出了一些相关的 GitHub 存储库，可以帮助你使用深度学习进行医疗诊断。

使用 DenseNet121 进行胸部 X 光诊断：https://github.com/LaurentVeyssier/Chest-X-Ray-Medical-Diagnosis-with-Deep-Learning
基于图像的 COVID-19 诊断：https://github.com/jeremykohn/rid-covid

代码片段：

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Set up data generators for training and validation sets
train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True)
train_generator = train_datagen.flow_from_directory('train_dir', target_size=(128, 128), batch_size=32, class_mode='binary')
val_datagen = ImageDataGenerator(rescale=1./255)
val_generator = val_datagen.flow_from_directory('val_dir', target_size=(128, 128), batch_size=32, class_mode='binary')




    
# Build a convolutional neural network for medical diagnosis
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(128, 128, 3)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model on the training set and evaluate it on the validation set
history = model.fit(train_generator, epochs=10, validation_data=val_generator)

# Plot the training and validation accuracy and loss curves
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

9. 使用机器学习进行音乐流派分类

这是最有趣的 GitHub 数据科学项目之一。这个项目非常具有挑战性，因为数据都是音乐！

问题陈述

这个独特的 GitHub 项目旨在帮助你学习如何使用音乐数据等非标准数据类型。此外，你还将学习如何根据不同的特征对此类数据进行分类。

项目和数据集的简要概述

在此项目中，你将收集音乐数据并使用它来训练和测试 ML 模型。由于音乐数据高度受版权保护，因此我们可以更轻松地使用 MSD（百万歌曲数据集）。

这个免费提供的数据集包含近一百万首歌曲的音频特征和元数据。这些歌曲属于不同的类别，如古典、迪斯科、嘻哈、雷鬼等。但是，你需要一个音乐提供商平台来传输“声音”。

数据集链接：http://millionsongdataset.com/

项目分步指南

第一步是收集音乐数据。
下一步是预处理数据。音乐数据通常通过将音频文件转换为可用作输入的特征向量来进行预处理。
处理数据后，必须探索频率、音高等特征。你可以使用梅尔频率倒谱系数方法、节奏特征等来研究数据。稍后你可以使用这些特征对歌曲进行分类。
选择合适的机器学习模型。它可以是多类 SVM 或 CNN，具体取决于数据集的大小和所需的准确性。

下面列出了一些可以帮助你细分客户的相关 GitHub 存储库。

音乐分类：https://github.com/mlachmish/MusicGenreClassification)。
使用 LSTM 进行音乐流派分类：https://github.com/ruohoruotsi/LSTM-Music-Genre-Classification

代码片段：

import os
import librosa
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from keras import models, layers

# Set up paths to audio files and genre labels
AUDIO_PATH = 'audio'
CSV_PATH = 'data.csv'

# Load audio files and extract features using librosa
def extract_features(file_path):
    audio_data, _ = librosa.load(file_path, sr=22050, mono=True, duration=30)
    mfccs = librosa.feature.mfcc(y=audio_data, sr=22050, n_mfcc=20)
    chroma_stft = librosa.feature.chroma_stft(y=audio_data, sr=22050)
    spectral_centroid = librosa.feature.spectral_centroid(y=audio_data, sr=22050)
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=audio_data, sr=22050)
    spectral_rolloff = librosa.feature.spectral_rolloff(y=audio_data, sr=22050)
    features = np.concatenate((np.mean(mfccs, axis=1), np.mean(chroma_stft, axis=1), np.mean(spectral_centroid), np.mean(spectral_bandwidth), np.mean(spectral_rolloff)))
    return features

# Load data from CSV file and extract features
data = pd.read_csv(CSV_PATH)
features = []
labels = []
for index, row in data.iterrows():
    file_path = os.path.join(AUDIO_PATH, row['filename'])
    genre = row['label']
    features.append(extract_features(file_path))
    labels.append(genre)

# Encode genre labels and scale features
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
scaler = StandardScaler()
features = scaler.fit_transform(np.array(features, dtype=float))

# Split data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2)

# Build a neural network for music genre classification
model = models.Sequential()



    
model.add(layers.Dense(256, activation='relu', input_shape=(train_features.shape[1],)))
model.add(layers.Dropout(0.3))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dropout(0.1))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model on the training set and evaluate it on the testing set
history = model.fit(train_features, train_labels, epochs=50, batch_size=128, validation_data=(test_features, test_labels))

# Plot the training and testing accuracy and loss curves
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Testing Accuracy')
plt.title('Training and Testing Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Testing Loss')
plt.title('Training and Testing Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

10. 用逻辑回归预测信用风险

预测信用风险是数据科学在金融行业最重要的应用之一。几乎所有贷款机构都使用机器学习进行信用风险预测。因此，如果你想提高数据科学家的技能并利用机器学习，那么做 GitHub 数据科学项目是一个很好的主意。

问题陈述

该项目是机器学习在金融领域的又一应用。它的目的是根据不同客户的财务记录、收入、债务规模和其他一些因素来预测他们的信用风险。

项目和数据集的简要概述

在此项目中，你将处理一个包含客户贷款详细信息的数据集。它包括许多特征，如贷款规模、利率、借款人收入、债务与收入比率等。所有这些特征一起分析时，将帮助你确定每个客户的信用风险。

数据集链接：https://github.com/shohaha/ML-predicting-credit-risk/tree/main/resources

项目分步指南

获取数据后，第一步是处理数据。需要清理数据以确保其适合分析。
探索数据集以深入了解不同的特征并发现异常和模式。这可能涉及使用直方图、散点图或热图可视化数据。
选择最相关的特征来使用。例如，在估计信用风险时以信用评分、收入或付款历史为目标。
将数据集用于训练和测试，并使用训练数据使用最大似然估计来拟合逻辑回归模型。此阶段近似客户未能还款的可能性。
模型准备就绪后，你可以使用精度、召回率等指标对其进行评估。

下面列出了一些可以帮助你预测信用风险的相关 GitHub 存储库。

使用机器学习预测信用风险：https://github.com/shohaha/ML-predicting-credit-risk
预测高风险贷款：https://github.com/laurenemilyto/predicting-credit-risk

代码片段：

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix

# Load data from CSV file
data = pd.read_csv('credit_data.csv')

# Clean data by removing missing values
data.dropna(inplace=True)

# Split data into features and labels
features = data[['loan_size', 'interest_rate', 'borrower_income', 'debt_to_income',
                 'num_of_accounts', 'derogatory_marks', 'total_debt']]
labels = data['loan_status']

# Scale features to have zero mean and unit variance
scaler = StandardScaler()
features = scaler.fit_transform(features)

# Split data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2)

# Build a logistic regression model for credit risk prediction
model = LogisticRegression()

# Train the model on the training set
model.fit(train_features, train_labels)

# Predict labels for the testing set
predictions = model.predict(test_features)

# Evaluate the model's accuracy and confusion matrix
accuracy = accuracy_score(test_labels, predictions)
conf_matrix = confusion_matrix(test_labels, predictions)
print('Accuracy:', accuracy)
print('Confusion Matrix:', conf_matrix)

输出：

在 GitHub 上为数据科学项目做出贡献的最佳实践

如果你是一名有抱负的数据科学家，那么参与 GitHub 数据科学项目并熟悉该平台的工作原理是必要的。作为一名数据科学家，你必须知道如何按照自己的方式收集数据、修改项目、实施变更以及与他人协作。本节讨论在处理 GitHub 项目时应遵循的一些最佳实践。

与其他贡献者的沟通与协作

当项目规模扩大时，单独处理它们几乎是不可能的。你必须与从事类似项目或概念的其他人合作。这也让你和其他人有机会利用更多样化的技能和视角，从而编写出更好的代码、更快的开发速度并增强模型性能。

遵循社区准则和项目标准。

GitHub 是全球知名的公共代码存储库，数据科学和机器学习领域的许多人都在使用。遵循社区准则和标准是跟踪所有更新并保持整个平台一致性的唯一方法。这些标准可以确保代码的高质量、安全性，并遵循行业最佳实践。

GitHub

编写干净的代码并记录更改

编码是一个直观的过程。可以有无数种方法来编码单个任务或应用程序。然而，首选版本将是最具可读性和最简洁的，因为随着时间的推移，它更容易理解和维护。这有助于减少错误并提高代码质量。

此外，记录对现有代码的更改和贡献使该过程对每个人都更加可信和透明。这有助于在平台上建立公众信任的元素。

测试和调试更改

持续测试和调试代码更改是确保质量和一致性的极好方法。它有助于识别不同系统、浏览器或平台的兼容性问题，确保项目在不同环境中按预期工作。由于问题可以及早得到解决，因此可以降低代码维护的长期成本。

如何在 GitHub 上展示你的数据科学项目？

如果你想知道如何推进你的 GitHub 数据科学项目，本节可供你参考。你可以从在 GitHub 上构建合法的数据分析师或数据科学家作品集开始。拥有个人资料后，请按照以下步骤操作。

创建一个具有描述性名称和简短描述的新存储库。
添加 README 文件，其中概述你的 GitHub 数据科学项目、数据集、方法以及你想要提供的任何其他信息。这可以包括你对项目的贡献、对社会的影响、成本等。
添加包含源代码的文件夹。确保代码干净且文档齐全。
如果你想公开你的存储库并愿意接收反馈/建议，请包含许可证。GitHub 提供了多种许可选项。

结论

作为对该领域感兴趣的人，你一定已经看到数据科学的世界在不断发展。无论是探索新的数据集还是构建更复杂的模型，数据科学都不断为日常业务运营增加价值。这种环境迫使人们将其作为一种职业来探索。对于所有有抱负的数据科学家和现有专业人士来说，GitHub 是数据科学家展示其工作并向他人学习的首选平台。这就是为什么本博客为初学者探索了 10 个 GitHub 数据科学项目，这些项目提供了不同的应用程序和挑战。通过探索这些项目，你可以更深入地了解数据科学工作流程，包括数据准备、探索、可视化和建模。





    
往期精彩


    
回顾




适合初学者入门人工智能的路线及资料下载
(图文+视频)机器学习入门系列下载



    
机器学习及深度学习笔记等资料打印
《统计学习方法》的代码复现专辑

机器学习交流qq群955171419，加入微信群请扫码

【机器学习】带有源代码的 10 个 GitHub 数据科学项目

目录

适合初学者的 10 个 GitHub 数据科学项目列表

1. 探索安然电子邮件数据集

问题陈述

该项目和安然电子邮件数据集的简要概述

项目分步指南

2. 用机器学习预测房价

问题陈述

项目概况及房价数据集

项目分步指南

3. 识别欺诈性信用卡交易

问题陈述

项目和数据集的简要概述

项目分步指南

4. 使用卷积神经网络进行图像分类

问题陈述

项目和数据集的简要概述

图像分类分步指南

5. Twitter数据的情感分析

问题陈述

项目和数据集的简要概述

项目分步指南

6. 分析 Netflix 电影和电视节目

问题陈述

项目和数据集的简要概述

分析 Netflix 电影的分步指南

7. 使用 K-Means 聚类进行客户细分

问题陈述

项目和数据集的简要概述

项目分步指南

8. 深度学习医学诊断

问题陈述

项目和数据集的简要概述

项目分步指南

9. 使用机器学习进行音乐流派分类

问题陈述

项目和数据集的简要概述

(adsbygoogle = window.adsbygoogle || []).push({}); 项目分步指南

10. 用逻辑回归预测信用风险

问题陈述

项目和数据集的简要概述

项目分步指南

在 GitHub 上为数据科学项目做出贡献的最佳实践

与其他贡献者的沟通与协作

遵循社区准则和项目标准。

编写干净的代码并记录更改

测试和调试更改

如何在 GitHub 上展示你的数据科学项目？

结论

项目分步指南