Control Engineering of China ›› 2019, Vol. 26 ›› Issue (2): 314-319.

Previous Articles     Next Articles

An Active Learning Algorithm Based on Imbalanced Datasets

  

  • Online:2019-02-20 Published:2023-10-26

基于不平衡数据集的主动学习分类算法

  

Abstract: Aiming at the problems of noise data and low classification accuracy in the classification process of imbalanced datasets, an active learning SVM classification algorithm based on improved SMOTE is proposed. This algorithm uses the attribution values of the minority class samples for training the sample set to choose and control the number of synthetic minority class samples by the majority vote method. According to the distance formula, the hyperplane is determined. The same number of majority class samples which are closest to the classification hyperplane are selected to form a balanced sample dataset. Support vector machine (SVM) is used to classify and obtain an optimal classifier. Then active learning is used to the imbalanced dataset which removes the training samples to circulate classification until samples of the imbalanced dataset is null by using the optimal classifier. Using UCI data, the experimental results show that the proposed algorithm can effectively reduce noise influence for data classification and improve the classification accuracy of the imbalanced dataset.

Key words: Data mining, unbalanced data sets, classification, SMOTE

摘要: 针对不平衡数据集在分类过程中易产生噪声数据和分类精度低的问题,提出一种基于改进SMOTE的不平衡数据集主动学习SVM分类算法。该算法对训练样本集利用少数类样本的归属值通过多数票选择法控制合成少数类样本的数量,以距离公式为衡量标准划分超平面,在分类超平面两侧选择最近距离的等量对称的多数类样本,组成平衡采样数据集,利用支持向量机(SVM)进行分类得到优化分类器,再用主动学习对去除了训练样本的不平衡数据集利用优化分类器进行分类循环,直到剩余样本为零。利用UCI数据集中的数据实验表明,该算法有效地减少了噪声数据对分类的影响,并有效改善不  平衡数据集的分类精度。

关键词: 数据挖掘, 不平衡数据集, 分类, 少数类样本合成过采样技术