This public repository contains the code to our paper “Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction”, published at the 25th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2024).
This public repository contains the code to our paper “Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction”, published at the [25th Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan (SIGDIAL 2024)](https://2024.sigdial.org/).
In this project, we demonstrate that contextual topological features derived from neighborhoods in a language model embedding space can be used to increase performance in a term extraction task on dialogue data over a baseline which only uses the language model embedding vectors as input.
## Citation
[TODO: Add link to ACL Anthology entry](http://example.com)
We also compare with our earlier work, ["Dialogue Term Extraction using Transfer Learning and Topological Data Analysis"](https://aclanthology.org/2022.sigdial-1.53/), which is based on topological features derived from static word embeddings.
title="Local Topology Measures of Contextual Language Model Latent Spaces with Applications to Dialogue Term Extraction",
author="Ruppik, Benjamin Matthias and
Heck, Michael and
van Niekerk, Carel and
Vukovic, Renato and
Lin, Hsien-chin and
Feng, Shutong and
Zibrowius, Marcus and
Gasic, Milica",
editor="Kawahara, Tatsuya and
Demberg, Vera and
Ultes, Stefan and
Inoue, Koji and
Mehri, Shikib and
Howcroft, David and
Komatani, Kazunori",
booktitle="Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
month=sep,
year="2024",
address="Kyoto, Japan",
publisher="Association for Computational Linguistics",
url="https://aclanthology.org/2024.sigdial-1.31",
doi="10.18653/v1/2024.sigdial-1.31",
pages="344--356",
abstract="A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e. a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.",