Merge branch 'updated_public_code_release' into 'main'

Updated release with link to debug data. See merge request !2

Merge branch 'updated_public_code_release' into 'main'
6bd3becc · Benjamin Ruppik · 9e614da2 · 79f4f010 · 6bd3becc · 6bd3becc
Commit 6bd3becc authored 9 months ago by Benjamin Ruppik
--- a/.gitignore
+++ b/.gitignore
@@ -84,10 +84,12 @@ data/experiments/

 data/figures/

-data/term_extraction/experiments/
+data/plots/
+
+data/term_extraction/BioTag_labels_via_tokenizer_offset/datasets_*/
 data/term_extraction/evaluation_files/
+data/term_extraction/experiments/
 data/term_extraction/predictions_files/
-data/term_extraction/BioTag_labels_via_tokenizer_offset/datasets_*/
 data/term_extraction/model_files/
 data/term_extraction/topological_features_vectorized/


--- a/README.md
+++ b/README.md
@@ -2,11 +2,11 @@

 ## Overview

-This public repository contains the code to our paper “Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction”, published at the [25th Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan (SIGDIAL 2024)](https://2024.sigdial.org/).
+This public repository contains the code to our paper [“Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction”](https://doi.org/10.18653/v1/2024.sigdial-1.31), published at the [25th Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan (SIGDIAL 2024)](https://2024.sigdial.org/).

 In this project, we demonstrate that contextual topological features derived from neighborhoods in a language model embedding space can be used to increase performance in a term extraction task on dialogue data over a baseline which only uses the language model embedding vectors as input.

-We also compare with our earlier work, ["Dialogue Term Extraction using Transfer Learning and Topological Data Analysis"](https://aclanthology.org/2022.sigdial-1.53/), which is based on topological features derived from static word embeddings.
+We also compare with our earlier work, ["Dialogue Term Extraction using Transfer Learning and Topological Data Analysis"](https://doi.org/10.18653/v1/2022.sigdial-1.53), which is based on topological features derived from static word embeddings.

 ## Installation

@@ -102,6 +102,8 @@ To create the data for the term extraction task, follow the instructions given i
 The following is an example call to run the full term extraction training-prediction-evaluation pipeline for a single setup with default parameters.
 Note that for this to work, you need to have the correctly prepared data in the `data/` directory.

+For testing the pipeline in the `--toy_dataset_mode`, we provide preprocessed [debug data](https://doi.org/10.5281/zenodo.14035394) which needs to be place into the repository's `data/` directory.
+
 Run one of the following commands to start the pipeline on the toy data,
 with different feature types:


--- a/instructions_for_publishing_to_public_repository.md
+++ b/instructions_for_publishing_to_public_repository.md
+# Instructions for publishing to public repository
+
+Add the public repository to the remote list:
+
+```bash
+git remote add public https://oauth2:$PROJECT_ACCESS_TOKEN@gitlab.cs.uni-duesseldorf.de/general/dsml/tda4contextualembeddings-public.git
+get fetch public
+```
+
+Create an orphan branch from the current repository state, so that we can push the content without the history to the public repository:
+
+```bash
+git checkout --orphan temp_branch_for_public
+git add .
+git commit -m "Updated public release of tda_for_contextual_spaces project"
+```
+
+Push the content of the new history-less branch to the public repository into a separate branch:
+
+```bash
+git push public temp_branch:updated_public_code_release
+```
+
+Delete the temporary branch:
+
+```bash
+git checkout main
+git branch -D temp_branch_for_public
+```