Note that for all currently implemented metrics the tcm is reduced to the top word space on basis of the terms in x.Ĭonsidering the use case of finding the optimum number of topics among several models with different metrics,Ĭalculating the mean score over all topics and normalizing this mean coherence scores from different metrics Note that depending on the use case, still, different settings than the standard settings for creation of tcm may be reasonable. That served for definition of standard settings for individual metrics. The currently implemented coherence metrics are described below including a description of theĬontent type of the tcm that showed good performance in combination with a specific metric.įor details on how to create tcm see the example section.įor details on performance of metrics see the resources in the reference section N_doc_tcm is used to calculate term probabilities from term counts as required for several metrics. The integer number of documents or text windows that was used to create the tcm. Numeric smoothing constant to avoid logarithm of zero. Please refer to the details section for more information on the metrics. Currently the following metrics are implemented:Ĭ("mean_logratio", "mean_pmi", "mean_npmi", "mean_difference", "mean_npmi_cosim", "mean_npmi_cosim2"). Is internally reduced to the top word space, i.e., all unique terms of x.Ĭharacter vector specifying the metrics to be calculated. Please also note that some efforts during any pre-processing steps might be skipped since the tcm With all entries in the lower triangle (excluding diagonal) set to zero (see, e.g., create_tcm). Please note that a memory efficient version of the tcm is assumed as input Serving as the reference to calculate coherence metrics. The term co-occurrence matrix, e.g, a Matrix::sparseMatrix or base::matrix, Terms of x have to be ranked per topic starting with rank 1 in row 1. "mean_difference", "mean_npmi_cosim", "mean_npmi_cosim2"),Ī character matrix with the top terms per topic (each column represents one topic), For more general information on measuring coherenceĪ starting point is given in the reference section.Ĭoherence(x, tcm, metrics = c("mean_logratio", "mean_pmi", "mean_npmi", On typical combinations of metric and type of tcm. Please refer to the details section (or reference section) for information This function is an implementation of several of the numerous possible metrics for such kind of assessments.Ĭoherence calculation is sensitive to the content of the reference tcm that is used for evaluationĪnd that may be created with different parameter settings. Given a topic model with topics represented as ordered term lists, the coherence may be used to assess the quality of individual topics. vectorizers: Vocabulary and hash vectorizersĬoherence metrics for topic models Description.tokenizers: Simple tokenization functions for string splitting.split_into: Split a vector for parallel processing.similarities: Pairwise Similarity Matrix Computation.RelaxedWordMoversDistance: Creates Relaxed Word Movers Distance (RWMD) model.reexports: Objects exported from other packages.prepare_analogy_questions: Prepares list of analogy questions.perplexity: Perplexity of a topic model.LatentSemanticAnalysis: Latent Semantic Analysis model.LatentDirichletAllocation: Creates Latent Dirichlet Allocation model.jsPCA_robust: (numerically robust) Dimension reduction via Jensen-Shannon.itoken: Iterators (and parallel iterators) over input objects.ifiles: Creates iterator over text files from the disk.distances: Pairwise Distance Matrix Computation.create_vocabulary: Creates a vocabulary of unique terms.create_tcm: Term-co-occurence matrix construction.create_dtm: Document-term matrix construction.combine_vocabularies: Combines multiple vocabularies into one.coherence: Coherence metrics for topic models.check_analogy_accuracy: Checks accuracy of word embeddings on the analogy task.as.lda_c: Converts document-term matrix sparse matrix to 'lda_c' format.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |