Matrix factorization and prediction for high dimensional zero inflated co-occurrence count data via shared parameter alternating generalized linear regression with application in NLP
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this dissertation, we study methods for analyzing the cooccurrence count data derived from practical fields such as user-item data from online shopping platform, cooccurring word-word pairs in sequences of texts. Such data contain important information for developing recommender systems or studying relevance of items or words from non-numerical sources. There are no observations for covariates and the co-occurrence matrix is typically of so high dimension that it does not fit into a computer's memory for modeling. We extract numerical data by defining windows of cooccurrence using weighted count on the continuous scale. Positive probability mass is allowed for zero observations. In this dissertation, we propose several likelihood-based mixture models with shared parameters to accommodate large amount of zero observations. Shared parameters also contribute additional difficulty in estimating the model parameters because different parts of the model need to be considered jointly during estimation. We summarize the relevance of user-item or word-word pairs by cosine similarity of their unknown dense vector representations. The unknown values are estimated via the shared parameter alternating regression models. Under this framework, we present detailed study of Shared parameter Alternating zero inflated gamma (SA-ZIG) regression and Shared parameter Alternating Tweedie (SA-Tweedie) regression, along with some applications in the Natural Language Processing (NLP) domain. For the SA-ZIG regression, canonical link and log link models were considered. Two parameter updating schemes are proposed along with an algorithm to estimate the unknown parameters. Convergence analysis is presented analytically. Numerical studies showed that SA-ZIG with learning rate adjustment has satisfactory performance but the SA-ZIG using Fisher scoring without learning rate adjustment may fail to find the maximum likelihood estimate. For the SA-Tweedie regression, multiple versions of an algorithm with different objective functions are proposed to estimate the parameters. Both models with and without constraints on parameters were studied. Updating formulae were given to obtain parameter estimates iteratively under each setting. A learning rate adjustment was used along with the Fisher scoring method to help the algorithm stay on track of optimizing direction. Numerical studies showed that our algorithm with Fisher scoring and learning rate adjustment outperforms the one without learning rate adjustment and gradient descent with Adam update. Pseudo-likelihood approach with alternating parameter update was also studied but found to be unsuitable for our shared parameter alternating regression models with unobserved covariates. The proposed algorithms were applied to Wikipedia dump data to obtain token/word vector representations, which are used in some NLP tasks such as finding the most similar words, sentiment analysis, and Named Entity Recognition task. Detailed analysis and comparisons were discussed.