LDA model¶
This page describes LDA class.
-
class
artm.
LDA
(num_topics=None, num_processors=None, cache_theta=False, dictionary=None, num_document_passes=10, seed=-1, alpha=0.01, beta=0.01, theta_columns_naming='id')¶ -
__init__
(num_topics=None, num_processors=None, cache_theta=False, dictionary=None, num_document_passes=10, seed=-1, alpha=0.01, beta=0.01, theta_columns_naming='id')¶ Parameters: - num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
- num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
- cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
- num_document_passes (int) – number of inner iterations over each document
- dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
- reuse_theta (bool) – reuse Theta from previous iteration or not
- seed (unsigned int or -1) – seed for random initialization, -1 means no seed
- alpha (float) – hyperparameter of Theta smoothing regularizer
- beta (float or list of floats with len == num_topics) – hyperparameter of Phi smoothing regularizer
- theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
Note: - the type (not value!) of beta should not change after initialization: if it was scalar - it should stay scalar, if it was list - it should stay list.
-
clone
()¶ Description: returns a deep copy of the artm.LDA object
Note: - This method is equivalent to copy.deepcopy() of your artm.LDA object. For more information refer to artm.ARTM.clone() method.
-
fit_offline
(batch_vectorizer, num_collection_passes=1)¶ Description: proceeds the learning of topic model in offline mode
Parameters: - batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
- num_collection_passes (int) – number of iterations over whole given collection
-
fit_online
(batch_vectorizer, tau0=1024.0, kappa=0.7, update_every=1)¶ Description: proceeds the learning of topic model in online mode
Parameters: - batch_vectorizer (object_reference) – an instance of BatchVectorizer class
- update_every (int) – the number of batches; model will be updated once per it
- tau0 (float) – coefficient (see ‘Update formulas’ paragraph)
- kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph)
- update_after (list of int) – number of batches to be passed for Phi synchronizations
Update formulas: - The formulas for decay_weight and apply_weight:
- update_count = current_processed_docs / (batch_size * update_every);
- rho = pow(tau0 + update_count, -kappa);
- decay_weight = 1-rho;
- apply_weight = rho;
-
get_theta
()¶ Description: get Theta matrix for training set of documents Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested;
- rows — the names of topics in topic model, that was used to create Theta;
- data — content of Theta matrix.
-
get_top_tokens
(num_tokens=10, with_weights=False)¶ Description: returns most probable tokens for each topic
Parameters: - num_tokens (int) – number of top tokens to be returned
- with_weights (bool) – return only tokens, or tuples (token, its p_wt)
Returns: - list of lists of str, each internal list corresponds one topic in natural order, if with_weights == False, or list, or list of lists of tules, each tuple is (str, float)
-
initialize
(dictionary)¶ Description: initialize topic model before learning Parameters: dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary
-
load
(filename, model_name='p_wt')¶ Description: loads from disk the topic model saved by LDA.save()
Parameters: - filename (str) – the name of file containing model
- model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note: - We strongly recommend you to reset all important parameters of the LDA model, used earlier.
-
remove_theta
()¶ Description: removes cached theta matrix
-
save
(filename, model_name='p_wt')¶ Description: saves one Phi-like matrix to disk
Parameters: - filename (str) – the name of file to store model
- model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
-
transform
(batch_vectorizer, theta_matrix_type='dense_theta')¶ Description: find Theta matrix for new documents
Parameters: - batch_vectorizer (object_reference) – an instance of BatchVectorizer class
- theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, None, default=’dense_theta’
Returns: - pandas.DataFrame: (data, columns, rows), where:
- columns — the ids of documents, for which the Theta matrix was requested;
- rows — the names of topics in topic model, that was used to create Theta;
- data — content of Theta matrix.
-