要标记大型配对训练数据,可以采用以下解决方法:
pairs = [
("I love dogs", "I hate cats", True),
("She is a doctor", "He is an engineer", False),
...
]
for pair in pairs:
sentence1, sentence2, label = pair
# 进行数据标记的操作
import spacy
nlp = spacy.load("en_core_web_md")
pairs = [
("I love dogs", "I hate cats", None),
("She is a doctor", "He is an engineer", None),
...
]
for pair in pairs:
sentence1, sentence2, _ = pair
doc1 = nlp(sentence1)
doc2 = nlp(sentence2)
similarity = doc1.similarity(doc2)
if similarity > 0.8:
pair[2] = True
else:
pair[2] = False
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
pairs = [
("I love dogs", "I hate cats", True),
("She is a doctor", "He is an engineer", False),
...
]
X = [pair[0] + " " + pair[1] for pair in pairs]
y = [pair[2] for pair in pairs]
vectorizer = CountVectorizer()
X_vectorized = vectorizer.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
以上是几种常见的解决方法,具体选择哪种方法取决于数据集的大小、可用资源和准确性要求。
上一篇:标记大小和缓存位练习
下一篇:标记的CAR点云KITTI