Implementation of resizing codec by ChWick · Pull Request #277 · ocropus-archive/DUP-ocropy

ChWick · 2017-12-13T10:16:39Z

If a pretrained model is used that has a different codec than the target text (e. g. historical documents) one has to adapt the codec to match the desired characters.

This pull request allows to automatically extend or shrink the codec based on the provided ground truth data after loading a pretrained model. This is done by changing the dimension of the output LSTM layer (before Softmax), whereby the old trained values are kept. Obviously, to learn the new characters the model must be retrained on the new data.

amitdo · 2017-12-13T15:08:36Z

I like the feature.

Did you test it? How does it compare to training from scratch?

Related:
tmbdev/clstm#106

ChWick · 2017-12-13T21:28:03Z

Our paper based on this technique applied to historical documents will hopefully be published this month. I will reference it as soon as it is available. Our findings and the improvements compared to training from scratch are documented there.

amitdo · 2017-12-13T21:41:04Z

Sounds promising :-)

amitdo · 2017-12-14T00:55:14Z

        for w,dw,n in self.net.weights():
            yield w,dw,"Reversed/%s"%n
+    def resizeoutput(self, no, deleted_positions):
+        self.net.resizeOutput(no, deleted_positions)


Python is case sensitive. 'resizeOutput' != 'resizeoutput'

Use the standard Python style for naming functions/methods
https://www.python.org/dev/peps/pep-0008/#method-names-and-instance-variables

ChWick · 2017-12-15T13:43:55Z

Function is renamed. If you prefer a resize_output to resizeOutput let me know.
In the new commit I added support for a FloatingPointingError exception during training. The codec will be resized in this case.
The paper should be available on arXiv on monday.

amitdo · 2017-12-15T13:59:11Z

Are you talking about this paper?
https://arxiv.org/abs/1711.09670

I added it to the 'Publications' wiki page:
https://github.com/tmbdev/ocropy/wiki/Publications

One of the authors names match your user name...

ChWick · 2017-12-15T14:05:23Z

No, this is another paper, but it is not using the resizing of the codec. The new paper was sumitted today to arXiv and therefore should be available on monday

amitdo · 2017-12-15T14:09:20Z

I saw your deleted comment that says 'You already found it'...
:-)

chreul · 2017-12-18T04:57:43Z

The corresponding paper is now available at arXiv: https://arxiv.org/abs/1712.05586

amitdo · 2017-12-18T10:31:09Z

Thanks for sharing your research and code.

Related, Training Tesseract 4.00:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Fine Tuning for Impact(new-font-style)
Fine Tuning for ± a few characters
Training Just a Few Layers

The second option is similar to what your patch does.
Ocropy does not have the third option.

amitdo · 2017-12-19T17:57:20Z

@chreul, @ChWick

Figure 1. Different example lines from the seven books used
From top to bottom: excerpts from books 1476, 1488, 1495, 1500, 1505, 1509, and 1572.

For example, book 1505 shows the least improvement over the default approach (but still 23% and 8%, respectively). Most likely this is caused by the fact that the distances between two characters in book 1505 are considerably smaller compared to all other books used for training and testing (see Figure 1, line 4).

Update: Fixed in version 2 (v2) of the paper.

amitdo · 2017-12-19T18:58:29Z

The location of the description of this patch in the paper:

3.3 Utilizing Arbitrary Pretrained Models in OCRopus

3.3.1 Extending the Codec

3.3.2 Reducing the Codec

mittagessen · 2017-12-24T10:36:07Z

Fine Tuning for Impact(new-font-style)
Fine Tuning for ± a few characters
Training Just a Few Layers

The second option is similar to what your patch does.

Technically the second and third option are equivalent. In both cases it is slicing off the final linear projection and just training a new one, although the weights are already somewhat meaningful when just a few rows are deleted. It is possible to add a complete weight reinitialization here although I'm unsure if the single LSTM layer learns representations well enough to be worth the effort.

zuphilip · 2018-01-21T10:19:37Z

@ChWick Thank you this looks very interesting! I have seen, that your paper in the journal 027.7 appeared http://0277.ch/ojs/index.php/cdrs_0277/article/view/169. This needs some time to check in details and test it through...

kba · 2018-02-19T17:43:04Z


 def normalize_nfkc(s):
-    return unicodedata.normalize('NFKC',s)
+    return unicodedata.normalize('NFC',s)


Why the change? Confusing since the method is called normalize_nfkc

kba

LGTM, thank you.

Let's merge this once clear whether the NFC/NFKC change was deliberate.

Also, a minimalist test for CI would be helpful: Train a minimal model, extend&shrink the character set, make sure it doesn't break. Maybe you have such sample data from developing this?

ChWick · 2018-02-20T09:23:43Z

The NFC/NFKC change was needed for our purposes but apparently not for this pull request. The change is undone, my branch is rebased onto the current master.

I propose as test 2 single text lines with different alphabet. Use the --codec argument to generate the appropriate codec for the initial model and the second model that loads the first one.
E.g.:

ocropus-rtrain --codec text_line_gt.txt --ntrain 2 -F 1 --output tmp test_file.png
ocropus-rtrain --codec 2nd_text_line_gt.txt --ntrain 2 --load tmp-00000001.pyrnn.gz 2nd_test_file.png

Also this must work (default codec in the initial model)

ocropus-rtrain --ntrain 2 -F 1 --output tmp test_file.png
ocropus-rtrain --codec 2nd_text_line_gt.txt --ntrain 2 --load tmp-00000001.pyrnn.gz 2nd_test_file.png

And the content of test_line_gt.txt is e.g. ABCDEFG, the one of 2nd_test_line_gt.txt: EFGHIJKL.

amitdo reviewed Dec 14, 2017

View reviewed changes

ChWick force-pushed the codec_resize branch from 64f82fa to f33831a Compare December 15, 2017 13:38

kba reviewed Feb 19, 2018

View reviewed changes

ChWick force-pushed the codec_resize branch from f33831a to c952a85 Compare February 20, 2018 09:10

Implementation of resizing codec

e13b859

ChWick force-pushed the codec_resize branch from c952a85 to e13b859 Compare February 20, 2018 09:14

zuphilip added the ✨ enhancement label Jan 13, 2019

Conversation

ChWick commented Dec 13, 2017

Uh oh!

amitdo commented Dec 13, 2017

Uh oh!

ChWick commented Dec 13, 2017

Uh oh!

amitdo commented Dec 13, 2017

Uh oh!

amitdo Dec 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChWick commented Dec 15, 2017

Uh oh!

amitdo commented Dec 15, 2017

Uh oh!

ChWick commented Dec 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitdo commented Dec 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chreul commented Dec 18, 2017

Uh oh!

amitdo commented Dec 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitdo commented Dec 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitdo commented Dec 19, 2017

3.3 Utilizing Arbitrary Pretrained Models in OCRopus

3.3.1 Extending the Codec

3.3.2 Reducing the Codec

Uh oh!

mittagessen commented Dec 24, 2017

Uh oh!

zuphilip commented Jan 21, 2018

Uh oh!

kba Feb 19, 2018

Choose a reason for hiding this comment

Uh oh!

kba left a comment

Choose a reason for hiding this comment

Uh oh!

ChWick commented Feb 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amitdo Dec 14, 2017 •

edited

Loading

ChWick commented Dec 15, 2017 •

edited

Loading

amitdo commented Dec 15, 2017 •

edited

Loading

amitdo commented Dec 18, 2017 •

edited

Loading

amitdo commented Dec 19, 2017 •

edited

Loading