diff --git a/.gitignore b/.gitignore index 6ca589780f36425419adca13b0e65b84c3381aa7..8462878ca066476f6f0a62be783d088de6a3aa0c 100644 --- a/.gitignore +++ b/.gitignore @@ -84,10 +84,12 @@ data/experiments/ data/figures/ -data/term_extraction/experiments/ +data/plots/ + +data/term_extraction/BioTag_labels_via_tokenizer_offset/datasets_*/ data/term_extraction/evaluation_files/ +data/term_extraction/experiments/ data/term_extraction/predictions_files/ -data/term_extraction/BioTag_labels_via_tokenizer_offset/datasets_*/ data/term_extraction/model_files/ data/term_extraction/topological_features_vectorized/ diff --git a/README.md b/README.md index 5c4e2beb2459cc6914d6a34a82bcadc43080c466..624be08e0da448dc3fd5c5900bab6152110b2b54 100644 --- a/README.md +++ b/README.md @@ -2,11 +2,11 @@ ## Overview -This public repository contains the code to our paper “Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction”, published at the [25th Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan (SIGDIAL 2024)](https://2024.sigdial.org/). +This public repository contains the code to our paper [“Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction”](https://doi.org/10.18653/v1/2024.sigdial-1.31), published at the [25th Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan (SIGDIAL 2024)](https://2024.sigdial.org/). In this project, we demonstrate that contextual topological features derived from neighborhoods in a language model embedding space can be used to increase performance in a term extraction task on dialogue data over a baseline which only uses the language model embedding vectors as input. -We also compare with our earlier work, ["Dialogue Term Extraction using Transfer Learning and Topological Data Analysis"](https://aclanthology.org/2022.sigdial-1.53/), which is based on topological features derived from static word embeddings. +We also compare with our earlier work, ["Dialogue Term Extraction using Transfer Learning and Topological Data Analysis"](https://doi.org/10.18653/v1/2022.sigdial-1.53), which is based on topological features derived from static word embeddings. ## Installation @@ -102,6 +102,8 @@ To create the data for the term extraction task, follow the instructions given i The following is an example call to run the full term extraction training-prediction-evaluation pipeline for a single setup with default parameters. Note that for this to work, you need to have the correctly prepared data in the `data/` directory. +For testing the pipeline in the `--toy_dataset_mode`, we provide preprocessed [debug data](https://doi.org/10.5281/zenodo.14035394) which needs to be place into the repository's `data/` directory. + Run one of the following commands to start the pipeline on the toy data, with different feature types: diff --git a/instructions_for_publishing_to_public_repository.md b/instructions_for_publishing_to_public_repository.md new file mode 100644 index 0000000000000000000000000000000000000000..22d131529efb0fc75bb7d8b64912b69391dd8f49 --- /dev/null +++ b/instructions_for_publishing_to_public_repository.md @@ -0,0 +1,29 @@ +# Instructions for publishing to public repository + +Add the public repository to the remote list: + +```bash +git remote add public https://oauth2:$PROJECT_ACCESS_TOKEN@gitlab.cs.uni-duesseldorf.de/general/dsml/tda4contextualembeddings-public.git +get fetch public +``` + +Create an orphan branch from the current repository state, so that we can push the content without the history to the public repository: + +```bash +git checkout --orphan temp_branch_for_public +git add . +git commit -m "Updated public release of tda_for_contextual_spaces project" +``` + +Push the content of the new history-less branch to the public repository into a separate branch: + +```bash +git push public temp_branch:updated_public_code_release +``` + +Delete the temporary branch: + +```bash +git checkout main +git branch -D temp_branch_for_public +```