Can Graph Neural Networks Go “Online”? An Analysis of Pretraining and Inference Galke, Lukas, Vagliano, Iacopo, and Scherp, Ansgar In Representation Learning on Graphs and Manifolds, ICLR Workshop 2019
Large-scale graph data in real-world applications is often not static but dynamic, i. e., new nodes and edges appear over time. Current graph convolution approaches are promising, especially, when all the graph\rqs nodes and edges are available during training. When unseen nodes and edges are inserted after training, it is not yet evaluated whether up-training or re-training from scratch is preferable. We construct an experimental setup, in which we insert previously unseen nodes and edges after training and conduct a limited amount of inference epochs. In this setup, we compare adapting pretrained graph neural networks against retraining from scratch. Our results show that pretrained models yield high accuracy scores on the unseen nodes and that pretraining is preferable over retraining from scratch. Our experiments represent a first step to evaluate and develop truly online variants of graph neural networks.
CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model Mai, Florian, Galke, Lukas, and Scherp, Ansgar In International Conference on Learning Representations 2019
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities to encode word content, CBOW embeddings perform well on a wide range of downstream tasks while being efficient to compute. However, CBOW is not capable of capturing the word order. The reason is that the computation of CBOW’s word embeddings is commutative, i.e., embeddings of XYZ and ZYX are the same. In order to address this shortcoming, we propose a learning algorithm for the Continuous Matrix Space Model, which we call Continual Multiplication of Words (CMOW). Our algorithm is an adaptation of word2vec, so that it can be trained on large quantities of unlabeled text. We empirically show that CMOW better captures linguistic properties, but it is inferior to CBOW in memorizing word content. Motivated by these findings, we propose a hybrid model that combines the strengths of CBOW and CMOW. Our results show that the hybrid CBOW-CMOW-model retains CBOW’s strong ability to memorize word content while at the same time substantially improving its ability to encode other linguistic information by 8%. As a result, the hybrid also performs better on 8 out of 11 supervised downstream tasks with an average improvement of 1.2%.
Inductive Learning of Concept Representations from Library-Scale Corpora with Graph Convolution Galke, Lukas, Melnychuk, Tetyana, Seidlmayer, Eva, Trog, Steffen, Foerstner, Konrad U., Schultz, Carsten, and Tochtermann, Klaus In INFORMATIK 2019
Automated research analyses are becoming more and more important as the volume of research items grows at an increasing pace. We pursue a new direction for dynamic research analyses with graph neural networks. So far, graph neural networks have only been applied to small-scale datasets and primarily supervised tasks such as node classification. We propose to use an unsupervised training objective for concept representation learning that is tailored towards bibliographic data with millions of research papers and thousands of concepts from a controlled vocabulary. We have evaluated the learned representations in clustering and classification downstream tasks. Furthermore, we have conducted nearest concept queries in the representation space. Our results show that the representations learned by graph convolution with our training objective are comparable to the ones learned by the DeepWalk algorithm. Our findings suggest that concept embeddings can be solely derived from the text of associated documents without using a lookup-table embedding. Thus, graph neural networks can operate on arbitrary document collections without re-training. This property makes graph neural networks useful for dynamic research analysis, which is often conducted on time-based snapshots of bibliographic data.
What If We Encoded Words as Matrices and Used Matrix Multiplication as Composition Function? Galke, Lukas, Mai, Florian, and Scherp, Ansgar In INFORMATIK 2019
We summarize our contribution to the International Conference on Learning Representations CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model, 2019. We construct a text encoder that learns matrix representations of words from unlabeled text, while using matrix multiplication as composition function. We show that our text encoder outperforms continuous bag-of-word representations on 9 out of 10 linguistic probing tasks and argue that the learned representations are complementary to the ones of vector-based approaches. Hence, we construct a hybrid model that jointly learns a matrix and a vector for each word. This hybrid model yields higher scores than purely vector-based approaches on 10 out of 16 downstream tasks in a controlled experiment with the same capacity and training data. Across all 16 tasks, the hybrid model achieves an average improvement of 1.2%. These results are insofar promising, as they open up new opportunities to efficiently incorporate order awareness into word embedding models.
Performance Comparison of Ad-Hoc Retrieval Models over Full-Text vs. Titles of Documents Saleh, Ahmed, Beck, Tilman, Galke, Lukas, and Scherp, Ansgar In Maturity and Innovation in Digital Libraries 2018
While there are many studies on information retrieval models using full-text, there are presently no comparison studies of full-text retrieval vs. retrieval only over the titles of documents. On the one hand, the full-text of documents like scientific papers is not always available due to, e.g., copyright policies of academic publishers. On the other hand, conducting a search based on titles alone has strong limitations. Titles are short and therefore may not contain enough information to yield satisfactory search results. In this paper, we compare different retrieval models regarding their search performance on the full-text vs. only titles of documents. We use different datasets, including the three digital library datasets: EconBiz, IREON, and PubMed. The results show that it is possible to build effective title-based retrieval models that provide competitive results comparable to full-text retrieval. The difference between the average evaluation results of the best title-based retrieval models is only 3% less than those of the best full-text-based retrieval models.
Using Adversarial Autoencoders for Multi-Modal Automatic Playlist Continuation Vagliano, Iacopo, Galke, Lukas, Mai, Florian, and Scherp, Ansgar In Proceedings of the ACM Recommender Systems Challenge 2018
The task of automatic playlist continuation is generating a list of recommended tracks that can be added to an existing playlist. By suggesting appropriate tracks, i. e., songs to add to a playlist, a recommender system can increase the user engagement by making playlist creation easier, as well as extending listening beyond the end of current playlist. The ACM Recommender Systems Challenge 2018 focuses on such task. Spotify released a dataset of playlists, which includes a large number of playlists and associated track listings. Given a set of playlists from which a number of tracks have been withheld, the goal is predicting the missing tracks in those playlists. We participated in the challenge as the team Unconscious Bias and, in this paper, we present our approach. We extend adversarial autoencoders to the problem of automatic playlist continuation. We show how multiple input modalities, such as the playlist titles as well as track titles, artists and albums, can be incorporated in the playlist continuation task.
A Case Study of Closed-Domain Response Suggestion with Limited Training Data Galke, Lukas, Gerstenkorn, Gunnar, and Scherp, Ansgar In Database and Expert Systems Applications 2018
We analyze the problem of response suggestion in a closed domain along a real-world scenario of a digital library. We present a text-processing pipeline to generate question-answer pairs from chat transcripts. On this limited amount of training data, we compare retrieval-based, conditioned-generation, and dedicated representation learning approaches for response suggestion. Our results show that retrieval-based methods that strive to find similar, known contexts are preferable over parametric approaches from the conditioned-generation family, when the training data is limited. We, however, identify a specific representation learning approach that is competitive to the retrieval-based approaches despite the training data limitation.
Multi-Modal Adversarial Autoencoders for Recommendations of Citations and Subject Labels Galke, Lukas, Mai, Florian, Vagliano, Iacopo, and Scherp, Ansgar In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization 2018
We present multi-modal adversarial autoencoders for recommendation and evaluate them on two different tasks: citation recommendation and subject label recommendation. We analyze the effects of adversarial regularization, sparsity, and different input modalities. By conducting 408 experiments, we show that adversarial regularization consistently improves the performance of autoencoders for recommendation. We demonstrate, however, that the two tasks differ in the semantics of item co-occurrence in the sense that item co-occurrence resembles relatedness in case of citations, yet implies diversity in case of subject labels. Our results reveal that supplying the partial item set as input is only helpful, when item co-occurrence resembles relatedness. When facing a new recommendation task it is therefore crucial to consider the semantics of item co-occurrence for the choice of an appropriate model.
Linked Open Citation Database: Enabling Libraries to Contribute to an Open and Interconnected Citation Graph Lauscher, Anne, Eckert, Kai, Galke, Lukas, Scherp, Ansgar, Rizvi, Syed Tahseen Raza, Ahmed, Sheraz, Dengel, Andreas, Zumstein, Philipp, and Klein, Annette In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries 2018
Citations play a crucial role in the scientific discourse, in information retrieval, and in bibliometrics. Many initiatives are currently promoting the idea of having free and open citation data. Creation of citation data, however, is not part of the cataloging workflow in libraries nowadays. In this paper, we present our project Linked Open Citation Database, in which we design distributed processes and a system infrastructure based on linked data technology. The goal is to show that efficiently cataloging citations in libraries using a semi-automatic approach is possible. We specifically describe the current state of the workflow and its implementation. We show that we could significantly improve the automatic reference extraction that is crucial for the subsequent data curation. We further give insights on the curation and linking process and provide evaluation results that not only direct the further development of the project, but also allow us to discuss its overall feasibility.
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text Mai, Florian, Galke, Lukas, and Scherp, Ansgar In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries 2018
For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from economics (EconBiz). In these datasets, the titles and annotations of millions of publications are available, and they outnumber the available full-texts by a factor of 20 and 15, respectively. To exploit these large amounts of data to their full potential, we develop three strong deep learning classifiers and evaluate their performance on the two datasets. The results are promising. On the EconBiz dataset, all three classifiers outperform their full-text counterparts by a large margin. The best title-based classifier outperforms the best full-text method by 9.9%. On the PubMed dataset, the best title-based method almost reaches the performance of the best full-text classifier, with a difference of only 2.9%.
Using Titles vs. Full-text As Source for Automated Semantic Document Annotation Galke, Lukas, Mai, Florian, Schelten, Alan, Brunsch, Dennis, and Scherp, Ansgar In Proceedings of the Knowledge Capture Conference 2017
We conduct the first systematic comparison of automated semantic annotation based on either the full-text or only on the title metadata of documents. Apart from the prominent text classification baselines kNN and SVM, we also compare recent techniques of Learning to Rank and neural networks and revisit the traditional methods logistic regression, Rocchio, and Naive Bayes. Across three of our four datasets, the performance of the classifications using only titles reaches over 90% of the quality compared to the performance when using the full-text.
Word Embeddings for Practical Information Retrieval Galke, Lukas, Saleh, Ahmed, and Scherp, Ansgar In INFORMATIK 2017
We assess the suitability of word embeddings for practical information retrieval scenarios. Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved documents after applying a boolean matching operation between the query and the documents. We compare the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents, namely word centroid similarity, paragraph vectors, Word Mover\rqs distance, as well as our novel inverse document frequency (IDF) re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally, we inspect the retrieval models\rq sensitivity to document length by using either only the title or the full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%.
Embedded Retrieval – Word Embeddings for Practical Information Retrieval Galke, Lukas 2017
We assess the suitability of word embeddings for practical information retrieval. While limiting ourselves to unsupervised models, we compare the performance of several techniques that leverage word embeddings to retrieval models (i. e.\@, provide a query-document similarity): the intuitive word centroid similarity, dedicated paragraph vectors, the physically inspired Word Mover’s distance, as well as a novel IDF re-weighted word centroid similarity. In our comparison, we thrive to simulate a strictly practical setting: short queries, a boolean matching operation, only the first twenty retrieved documents are considered. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank and normalised discounted cumulative gain. Additionally, we inspect the retrieval models’ sensitivity to document length by using either only the title or the full-text as documents. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models and can be further improved by re-weighting the word frequencies according to inverse document frequency before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%. In the context of this research contribution, a dedicated information retrieval framework has been developed. The key features include the incorporation of embedding-based retrieval models, the simulation of a practical setting, automatic evaluation as well as convenient extendability by new retrieval models. The corresponding user’s guide and developer’s guide are part of this work.