diff --git a/README.md b/README.md index 94f47eac8798b9bf2926c8835890763a5b3136ad..3d8db0a9da10ede7d8ef76f2ad3849eae5be941d 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ If neither `--train` nor `--validate` is specified, the NN infers the text from ## Integrate word beam search decoding -It is possible to use the word beam search decoder \[4\] instead of the two decoders shipped with TF. +It is possible to use the [word beam search decoder](https://repositum.tuwien.ac.at/obvutwoa/download/pdf/2774578) instead of the two decoders shipped with TF. Words are constrained to those contained in a dictionary, but arbitrary non-word character strings (numbers, punctuation marks) can still be recognized. The following illustration shows a sample for which word beam search is able to recognize the correct text, while the other decoders fail. @@ -61,7 +61,7 @@ Beam width is set to 50 to conform with the beam width of vanilla beam search de ## Train model with IAM dataset -Follow these instructions to get the IAM dataset \[5\]: +Follow these instructions to get the IAM dataset: * Register for free at this [website](http://www.fki.inf.unibe.ch/databases/iam-handwriting-database) * Download `words/words.tgz` @@ -88,7 +88,7 @@ Using the `--fast` option and a GTX 1050 Ti training takes around 3h with a batc ## Information about model ### Overview -The model \[1\] is a stripped-down version of the HTR system I implemented for my thesis \[2\]\[3\]. +The model is a stripped-down version of the HTR system I implemented for [my thesis]((https://repositum.tuwien.ac.at/obvutwhs/download/pdf/2874742)). What remains is what I think is the bare minimum to recognize text with an acceptable accuracy. It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. The illustration below gives an overview of the NN (green: operations, pink: data flowing through NN) and here follows a short description: @@ -102,33 +102,15 @@ The illustration below gives an overview of the NN (green: operations, pink: dat  -### Analyze model -Run `python analyze.py` with the following arguments to analyze the image file `data/analyze.png` with the ground-truth text "are": - -* `--relevance`: compute the pixel relevance for the correct prediction -* `--invariance`: check if the model is invariant to horizontal translations of the text -* No argument provided: show the results - -Results are shown in the plots below. -For more information see [this article](https://towardsdatascience.com/6c04864b8a98). - - - - ## FAQ * I get the error message "... TFWordBeamSearch.so: cannot open shared object file: No such file or directory": if you want to use word beam search decoding, you have to compile the custom TF operation from source * Where can I find the file `words.txt` of the IAM dataset: it is located in the subfolder `ascii` on the IAM website -* I want to recognize text of line (or sentence) images: this is not possible with the provided model. The size of the input image is too small. For more information read [this article](https://medium.com/@harald_scheidl/27648fb18519) or have a look at the [lamhoangtung/LineHTR](https://github.com/lamhoangtung/LineHTR) repository +* I want to recognize the text contained in a text-line: the model is too small for this, you have to first segment the line into words, e.g. using the model from the [WordDetectorNN](https://github.com/githubharald/WordDetectorNN) repository * I get an error when running the script more than once from an interactive Python session: do **not** call function `main()` in file `main.py` from an interactive session, as the TF computation graph is created multiple times when calling `main()` multiple times. Run the script by executing `python main.py` instead ## References -\[1\] [Build a Handwritten Text Recognition System using TensorFlow](https://towardsdatascience.com/2326a3487cd5) - -\[2\] [Scheidl - Handwritten Text Recognition in Historical Documents](https://repositum.tuwien.ac.at/obvutwhs/download/pdf/2874742) - -\[3\] [Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition](https://arxiv.org/pdf/1507.05717.pdf) - -\[4\] [Scheidl - Word Beam Search: A Connectionist Temporal Classification Decoding Algorithm](https://repositum.tuwien.ac.at/obvutwoa/download/pdf/2774578) +* [Build a Handwritten Text Recognition System using TensorFlow](https://towardsdatascience.com/2326a3487cd5) +* [Scheidl - Handwritten Text Recognition in Historical Documents](https://repositum.tuwien.ac.at/obvutwhs/download/pdf/2874742) +* [Scheidl - Word Beam Search: A Connectionist Temporal Classification Decoding Algorithm](https://repositum.tuwien.ac.at/obvutwoa/download/pdf/2774578) -\[5\] [Marti - The IAM-database: an English sentence database for offline handwriting recognition](http://www.fki.inf.unibe.ch/databases/iam-handwriting-database) diff --git a/data/analyze.png b/data/analyze.png deleted file mode 100644 index ae1c88b34174e82736122fdd824fd6cf06e4dd11..0000000000000000000000000000000000000000 Binary files a/data/analyze.png and /dev/null differ diff --git a/doc/analyze.png b/doc/analyze.png deleted file mode 100644 index edf28ae5ef29fdc921cfe71422272a762724179f..0000000000000000000000000000000000000000 Binary files a/doc/analyze.png and /dev/null differ diff --git a/src/analyze.py b/src/analyze.py deleted file mode 100644 index 40f1e583958b231e63f030a5327469ccb71d72cc..0000000000000000000000000000000000000000 --- a/src/analyze.py +++ /dev/null @@ -1,156 +0,0 @@ -import copy -import math -import pickle -import sys - -import cv2 -import matplotlib.pyplot as plt -import numpy as np - -from DataLoaderIAM import Batch -from Model import Model, DecoderType -from SamplePreprocessor import preprocess - - -# constants like filepaths -class Constants: - "filenames and paths to data" - fnCharList = '../model/charList.txt' - fnAnalyze = '../data/analyze.png' - fnPixelRelevance = '../data/pixelRelevance.npy' - fnTranslationInvariance = '../data/translationInvariance.npy' - fnTranslationInvarianceTexts = '../data/translationInvarianceTexts.pickle' - gtText = 'are' - distribution = 'histogram' # 'histogram' or 'uniform' - - -def odds(val): - return val / (1 - val) - - -def weightOfEvidence(origProb, margProb): - return math.log2(odds(origProb)) - math.log2(odds(margProb)) - - -def analyzePixelRelevance(): - "simplified implementation of paper: Zintgraf et al - Visualizing Deep Neural Network Decisions: Prediction Difference Analysis" - - # setup model - model = Model(open(Constants.fnCharList).read(), DecoderType.BestPath, mustRestore=True) - - # read image and specify ground-truth text - img = cv2.imread(Constants.fnAnalyze, cv2.IMREAD_GRAYSCALE) - (w, h) = img.shape - assert Model.imgSize[1] == w - - # compute probability of gt text in original image - batch = Batch([Constants.gtText], [preprocess(img, Model.imgSize)]) - (_, probs) = model.inferBatch(batch, calcProbability=True, probabilityOfGT=True) - origProb = probs[0] - - grayValues = [0, 63, 127, 191, 255] - if Constants.distribution == 'histogram': - bins = [0, 31, 95, 159, 223, 255] - (hist, _) = np.histogram(img, bins=bins) - pixelProb = hist / sum(hist) - elif Constants.distribution == 'uniform': - pixelProb = [1.0 / len(grayValues) for _ in grayValues] - else: - raise Exception('unknown value for Constants.distribution') - - # iterate over all pixels in image - pixelRelevance = np.zeros(img.shape, np.float32) - for x in range(w): - for y in range(h): - - # try a subset of possible grayvalues of pixel (x,y) - imgsMarginalized = [] - for g in grayValues: - imgChanged = copy.deepcopy(img) - imgChanged[x, y] = g - imgsMarginalized.append(preprocess(imgChanged, Model.imgSize)) - - # put them all into one batch - batch = Batch([Constants.gtText] * len(imgsMarginalized), imgsMarginalized) - - # compute probabilities - (_, probs) = model.inferBatch(batch, calcProbability=True, probabilityOfGT=True) - - # marginalize over pixel value (assume uniform distribution) - margProb = sum([probs[i] * pixelProb[i] for i in range(len(grayValues))]) - - pixelRelevance[x, y] = weightOfEvidence(origProb, margProb) - - print(x, y, pixelRelevance[x, y], origProb, margProb) - - np.save(Constants.fnPixelRelevance, pixelRelevance) - - -def analyzeTranslationInvariance(): - # setup model - model = Model(open(Constants.fnCharList).read(), DecoderType.BestPath, mustRestore=True) - - # read image and specify ground-truth text - img = cv2.imread(Constants.fnAnalyze, cv2.IMREAD_GRAYSCALE) - (w, h) = img.shape - assert Model.imgSize[1] == w - - imgList = [] - for dy in range(Model.imgSize[0] - h + 1): - targetImg = np.ones((Model.imgSize[1], Model.imgSize[0])) * 255 - targetImg[:, dy:h + dy] = img - imgList.append(preprocess(targetImg, Model.imgSize)) - - # put images and gt texts into batch - batch = Batch([Constants.gtText] * len(imgList), imgList) - - # compute probabilities - (texts, probs) = model.inferBatch(batch, calcProbability=True, probabilityOfGT=True) - - # save results to file - f = open(Constants.fnTranslationInvarianceTexts, 'wb') - pickle.dump(texts, f) - f.close() - np.save(Constants.fnTranslationInvariance, probs) - - -def showResults(): - # 1. pixel relevance - pixelRelevance = np.load(Constants.fnPixelRelevance) - plt.figure('Pixel relevance') - - plt.imshow(pixelRelevance, cmap=plt.cm.jet, vmin=-0.25, vmax=0.25) - plt.colorbar() - - img = cv2.imread(Constants.fnAnalyze, cv2.IMREAD_GRAYSCALE) - plt.imshow(img, cmap=plt.cm.gray, alpha=.4) - - # 2. translation invariance - probs = np.load(Constants.fnTranslationInvariance) - f = open(Constants.fnTranslationInvarianceTexts, 'rb') - texts = pickle.load(f) - texts = ['%d:' % i + texts[i] for i in range(len(texts))] - f.close() - - plt.figure('Translation invariance') - - plt.plot(probs, 'o-') - plt.xticks(np.arange(len(texts)), texts, rotation='vertical') - plt.xlabel('horizontal translation and best path') - plt.ylabel('text probability of "%s"' % Constants.gtText) - - # show both plots - plt.show() - - -if __name__ == '__main__': - if len(sys.argv) > 1: - if sys.argv[1] == '--relevance': - print('Analyze pixel relevance') - analyzePixelRelevance() - elif sys.argv[1] == '--invariance': - print('Analyze translation invariance') - analyzeTranslationInvariance() - else: - print('Show results') - showResults()