@@ -14,10 +14,10 @@ The model takes **images of single words or text lines (multiple words) as input
## Run demo
* Download one of the trained models
*[Model trained on word images](https://www.dropbox.com/s/lod3gabgtuj0zzn/model.zip?dl=1):
only handle single words per image, but gives better results for IAM word dataset
*[Model trained on text line images](TODO):
* Download one of the pretrained models
*[Model trained on word images](https://www.dropbox.com/s/mya8hw6jyzqm0a3/word-model.zip?dl=1):
only handles single words per image, but gives better results on the IAM word dataset
*[Model trained on text line images](https://www.dropbox.com/s/7xwkcilho10rthn/line-model.zip?dl=1):
can handle multiple words in one image
* Put the contents of the downloaded zip-file into the `model` directory of the repository
* Go to the `src` directory
...
...
@@ -30,28 +30,27 @@ The input images, and the expected outputs are shown below when the text line mo

```
> python main.py
Init with stored values from ../model/snapshot-15
Init with stored values from ../model/snapshot-13
Recognized: "word"
Probability: 0.9741360545158386
Probability: 0.9806370139122009
```

```
> python main.py --img_file ../data/line.png
Init with stored values from ../model/snapshot-15
Init with stored values from ../model/snapshot-13
Recognized: "or work on line level"
Probability: 0.8010453581809998
Probability: 0.6674373149871826
```
## Command line arguments
*`--mode`: select between "train", "validate" and "infer". Defaults to "infer".
*`--decoder`: select from CTC decoders "bestpath", "beamsearch" and "wordbeamsearch". Defaults to "bestpath". For option "wordbeamsearch" see details below.
*`--batch_size`: batch size.
*`--data_dir`: directory containing IAM dataset (with subdirectories `img` and `gt`).
*`--fast`: use LMDB to load images (faster than loading image files from disk).
*`--line_mode`': train reading text lines instead of single words
*`--fast`: use LMDB to load images faster.
*`--line_mode`: train reading text lines instead of single words.
*`--img_file`: image that is used for inference.
*`--dump`: dumps the output of the NN to CSV file(s) saved in the `dump` folder. Can be used as input for the [CTCDecoder](https://github.com/githubharald/CTCDecoder).
...
...
@@ -75,8 +74,9 @@ Further, the manually created list of word-characters can be found in the file `
Beam width is set to 50 to conform with the beam width of vanilla beam search decoding.
## Train model with IAM dataset
## Train model on IAM dataset
### Prepare dataset
Follow these instructions to get the IAM dataset:
* Register for free at this [website](http://www.fki.inf.unibe.ch/databases/iam-handwriting-database)
...
...
@@ -86,7 +86,7 @@ Follow these instructions to get the IAM dataset:
* Put `words.txt` into the `gt` directory
* Put the content (directories `a01`, `a02`, ...) of `words.tgz` into the `img` directory
### Start the training
### Run training
* Delete files from `model` directory if you want to train from scratch
* Go to the `src` directory and execute `python main.py --mode train --data_dir path/to/IAM`
...
...
@@ -95,32 +95,35 @@ Follow these instructions to get the IAM dataset:
the model is trained on text line images created by combining multiple word images into one
* Training stops after a fixed number of epochs without improvement
The pretrained word model was trained with this command on a GTX 1050 Ti:
Loading and decoding the png image files from the disk is the bottleneck even when using only a small GPU.
The database LMDB is used to speed up image loading:
* Go to the `src` directory and run `create_lmdb.py --data_dir path/to/IAM` with the IAM data directory specified
* Go to the `src` directory and run `create_lmdb.py --data_dir path/to/iam` with the IAM data directory specified
* A subfolder `lmdb` is created in the IAM data directory containing the LMDB files
* When training the model, add the command line option `--fast`
The dataset should be located on an SSD drive.
Using the `--fast` option and a GTX 1050 Ti training single words takes around 3h with a batch size of 500.
Training text lines takes a bit longer.
Using the `--fast` option and a GTX 1050 Ti training on single words takes around 3h with a batch size of 500.
Training on text lines takes a bit longer.
## Information about model
The model is a stripped-down version of the HTR system I implemented for [my thesis]((https://repositum.tuwien.ac.at/obvutwhs/download/pdf/2874742)).
What remains is what I think is the bare minimum to recognize text with an acceptable accuracy.
What remains is the bare minimum to recognize text with an acceptable accuracy.
It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer.
The illustration below gives an overview of the NN (green: operations, pink: data flowing through NN) and here follows a short description:
* The input image is a gray-value image and has a size of 128x32
(in training mode the width is fixed, while in inference mode there is no restriction other than being a multiple of 4)
* 5 CNN layers map the input image to a feature sequence of size 32x256
* 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32x80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
* The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)

For more details see this [Medium article](https://towardsdatascience.com/2326a3487cd5).