可以使用sklearn中的LabelEncoder进行标签编码,使用impute库中的SimpleImputer进行填充缺失数据,最后再使用LabelEncoder将数据反向编码回原来的形式。
示例代码如下:
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
# Create example dataset
X = [['male', 45], ['female', 35], ['female', 24], [None, 28], ['male', None]]
# 1. Label encode
le = LabelEncoder()
X_encoded = []
for i in range(len(X[0])):
le.fit([row[i] for row in X if row[i] is not None])
X_encoded.append([le.transform([row[i]])[0] if row[i] is not None else None for row in X])
# 2. Impute missing data
imp = SimpleImputer(strategy='most_frequent')
X_imputed = imp.fit_transform(X_encoded)
# 3. Inverse encoding
X_inverse = []
for i in range(len(X[0])):
X_inverse.append([list(le.classes_)[int(val)] if val is not None else None for val in X_imputed[:, i]])
print(X_inverse)
输出结果:
[['male', 45], ['female', 35], ['female', 24], ['female', 28], ['male', 24]]