tdacontextual
Overview
This public repository contains the code to our paper “Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction”, published at the 25th Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan (SIGDIAL 2024).
In this project, we demonstrate that contextual topological features derived from neighborhoods in a language model embedding space can be used to increase performance in a term extraction task on dialogue data over a baseline which only uses the language model embedding vectors as input.
We also compare with our earlier work, "Dialogue Term Extraction using Transfer Learning and Topological Data Analysis", which is based on topological features derived from static word embeddings.
Installation
Pre-requisites
- Preparing the environment
Required Python version: 3.10
On MacOS, you can install pyenv
with homebrew: brew install pyenv
.
You can install poetry
with pipx
for example: pipx install poetry
.
- Install python version with
pyenv
and set local python version for the project.
pyenv install 3.10
pyenv local 3.10
Installation instructions with poetry
- Tell poetry to use the local python version.
# Optional:
# Tell poetry to create a virtual environment for the project
# inside the project directory.
poetry config virtualenvs.in-project true
poetry env use 3.10
You can manage the poetry environments with the following commands:
poetry env list --full-path # List all the environments
poetry env remove <path> # Remove an environment
poetry env remove --all # Remove all the environments
- Install the project with dependencies. Select the appropriate dependency groups for your system.
poetry install --with gpu,dev --without cpu # For GPU
poetry install --with cpu,dev --without gpu # For CPU
NOTE:
You might need to install faiss manually using conda, since pip installs can lead to problems.
Please see the official documentation for more information: Faiss Documentation.
In that case, you can try the installation into a new conda environment with the following commands,
which will create a new conda environment and install faiss
via conda,
then install tdacontextual
(this module) with poetry.
The resulting environment can be used for the neighborhood computation scripts.
conda create -n "python3.10_for_tdacontextual" python=3.10
conda activate "python3.10_for_tdacontextual"
conda install -c pytorch faiss-cpu
cd $TDACONTEXTUAL_REPOSITORY_BASE_PATH # Change to the repository root directory
poetry install --with dev --without cpu,gpu
NOTE:
Some of the data preparation scripts require the convlab
package.
If you run the python scripts from the command line, make sure to activate the poetry environment:
poetry shell
Project-specific setup
- Set the correct environment variables used in the project config.
Edit the script
tdacontextual/scripts/setup_environment.sh
with the correct paths and run it once.
./tdacontextual/scripts/setup_environment.sh
- If required, e.g. when running jobs on the HHU Hilbert cluster, set the correct environment variables in the
.env
file in the project root directory.
Usage
The main entry points for the project are specified as poetry run commands, check the pyproject.toml
file for more information.
To create the data for the term extraction task, follow the instructions given in the tdacontextual/term_extraction
directory.
The following is an example call to run the full term extraction training-prediction-evaluation pipeline for a single setup with default parameters.
Note that for this to work, you need to have the correctly prepared data in the data/
directory.
For testing the pipeline in the --toy_dataset_mode
, we provide preprocessed debug data which needs to be place into the repository's data/
directory.
Run one of the following commands to start the pipeline on the toy data, with different feature types:
# Run with input just the language model embeddings
poetry run full_training_prediction_evaluation_pipeline_for_single_setup --feature_type="lm" --toy_dataset_mode
# Run with input the language model embeddings and the contextual topological features
poetry run full_training_prediction_evaluation_pipeline_for_single_setup --feature_type="lm_c_pis_h0" --toy_dataset_mode
Run the following command to start the pipeline on the full data:
poetry run full_training_prediction_evaluation_pipeline_for_single_setup --feature_type="lm"
The resulting artifacts (training and evaluation metrics, predictions, logs, etc.) will be saved in the data/experiments
directory.
Running tests
Run the following command from the repository root:
poetry run python3 -m pytest -m "not slow" tests/ --cov=tdacontextual/ --cov-report=html:tests/temp_files/coverage_report
We also provide a script to run the tests, execute the following command from the repository root:
./tdacontextual/scripts/run_tests_with_coverage.sh
Citation
doi:10.18653/v1/2024.sigdial-1.31
@inproceedings{ruppik-etal-2024-local,
title = "Local Topology Measures of Contextual Language Model Latent Spaces with Applications to Dialogue Term Extraction",
author = "Ruppik, Benjamin Matthias and
Heck, Michael and
van Niekerk, Carel and
Vukovic, Renato and
Lin, Hsien-chin and
Feng, Shutong and
Zibrowius, Marcus and
Gasic, Milica",
editor = "Kawahara, Tatsuya and
Demberg, Vera and
Ultes, Stefan and
Inoue, Koji and
Mehri, Shikib and
Howcroft, David and
Komatani, Kazunori",
booktitle = "Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
month = sep,
year = "2024",
address = "Kyoto, Japan",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sigdial-1.31",
doi = "10.18653/v1/2024.sigdial-1.31",
pages = "344--356",
abstract = "A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e. a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.",
}