LDA model¶

This page describes LDA class.

class artm.LDA(num_topics=None, num_processors=None, cache_theta=False, dictionary=None, num_document_passes=10, seed=-1, alpha=0.01, beta=0.01, theta_columns_naming='id')¶

__init__(num_topics=None, num_processors=None, cache_theta=False, dictionary=None, num_document_passes=10, seed=-1, alpha=0.01, beta=0.01, theta_columns_naming='id')¶

Parameters:

num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
num_document_passes (int) – number of inner iterations over each document
dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
reuse_theta (bool) – reuse Theta from previous iteration or not
seed (unsigned int or -1) – seed for random initialization, -1 means no seed
alpha (float) – hyperparameter of Theta smoothing regularizer
beta (float or list of floats with len == num_topics) – hyperparameter of Phi smoothing regularizer
theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe

Note:

the type (not value!) of beta should not change after initialization: if it was scalar - it should stay scalar, if it was list - it should stay list.

clone()¶

Description:	returns a deep copy of the artm.LDA object
Note:	This method is equivalent to copy.deepcopy() of your artm.LDA object. For more information refer to artm.ARTM.clone() method.

fit_offline(batch_vectorizer, num_collection_passes=1)¶

Description:	proceeds the learning of topic model in offline mode
Parameters:	batch_vectorizer (object_referenece) – an instance of BatchVectorizer class num_collection_passes (int) – number of iterations over whole given collection

fit_online(batch_vectorizer, tau0=1024.0, kappa=0.7, update_every=1)¶

Update formulas:
Description:	proceeds the learning of topic model in online mode
Parameters:	batch_vectorizer (object_reference) – an instance of BatchVectorizer class update_every (int) – the number of batches; model will be updated once per it tau0 (float) – coefficient (see ‘Update formulas’ paragraph) kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph) update_after (list of int) – number of batches to be passed for Phi synchronizations
	The formulas for decay_weight and apply_weight: update_count = current_processed_docs / (batch_size * update_every); rho = pow(tau0 + update_count, -kappa); decay_weight = 1-rho; apply_weight = rho;

get_theta()¶

Description:	get Theta matrix for training set of documents
Returns:	pandas.DataFrame: (data, columns, rows), where: columns — the ids of documents, for which the Theta matrix was requested; rows — the names of topics in topic model, that was used to create Theta; data — content of Theta matrix.

get_top_tokens(num_tokens=10, with_weights=False)¶

Description:

returns most probable tokens for each topic

Parameters:

num_tokens (int) – number of top tokens to be returned
with_weights (bool) – return only tokens, or tuples (token, its p_wt)

Returns:

list of lists of str, each internal list corresponds one topic in natural order, if with_weights == False, or list, or list of lists of tules, each tuple is (str, float)

initialize(dictionary)¶

Description:	initialize topic model before learning
Parameters:	dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary

load(filename, model_name='p_wt')¶

Description:	loads from disk the topic model saved by LDA.save()
Parameters:	filename (str) – the name of file containing model model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note:	We strongly recommend you to reset all important parameters of the LDA model, used earlier.

remove_theta()¶

Description:	removes cached theta matrix

save(filename, model_name='p_wt')¶

Description:	saves one Phi-like matrix to disk
Parameters:	filename (str) – the name of file to store model model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’

transform(batch_vectorizer, theta_matrix_type='dense_theta')¶

Description:

find Theta matrix for new documents

Parameters:

batch_vectorizer (object_reference) – an instance of BatchVectorizer class
theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, None, default=’dense_theta’

Returns:

pandas.DataFrame: (data, columns, rows), where:
columns — the ids of documents, for which the Theta matrix was requested;
rows — the names of topics in topic model, that was used to create Theta;
data — content of Theta matrix.