Machine learning ⊂ Intelligence artificielle
Permet de faire prendre des décisions à une machine, basées sur un processus d'apprentissage préalable, par le biais de données empiriques.
Image(filename='img/ml_map.png')
IFrame(src='http://www.grandlyon.com/Fete-des-lumieres.4977.0.html', width=1000, height=800)
Image('img/clustering2.png')
Associer une classe / un label à un élément
data = requests.get(
'http://balthazar-rouberol.com/slides/mlintro/data/classified-articles.json').json()
stopwords = [word.strip() for word in codecs.open('data/stopwords.txt', 'r', 'utf-8')]
text_vectorizer = TfidfVectorizer(
max_df=4000, # max number of relevant tokens
min_df=6, # min number of relevant tokens
max_features=500, # maximum number of features
strip_accents='unicode', # replace all accented unicode
# chars by their corresponding ASCII char
stop_words=stopwords,
analyzer='word', # features made of words
token_pattern=r'\w{4,}', # tokenize only words of 4+ chars
ngram_range=(1, 1), # features made of a single tokens
use_idf=True, # enable inverse-document-frequency reweighting
smooth_idf=True, # prevents zero division for unseen words
sublinear_tf=False)
# vetorize all training articles
train = articles['train']['sport'] + articles['train']['politics']
text_vector = text_vectorizer.fit_transform(train)
# trainin rubric array
rubrics = ['sport'] * len(articles['train']['sport']) + \
['politics'] * len(articles['train']['politics'])
clf = LinearSVC()
clf.fit(text_vector, rubrics)
def predict_rubric(article):
"""Vectorize the article using the training text vectorizer
and predict the article rubric.
"""
article_vector = text_vectorizer.transform([article])
return clf.predict(article_vector)
NLTK
(NLP uniquement)nlpy
(classification, regression, clustering)PyBrain
(réseaux de neurones)bigml
(API REST)scikit-learn
(classification, régression, clustering, sélection de features, visualisation, etc)scikit-learn
: +/-numpy
& scipy
)