Skip to content
This repository was archived by the owner on Apr 27, 2026. It is now read-only.

Implementation of resizing codec#277

Open
ChWick wants to merge 1 commit intoocropus-archive:masterfrom
ChWick:codec_resize
Open

Implementation of resizing codec#277
ChWick wants to merge 1 commit intoocropus-archive:masterfrom
ChWick:codec_resize

Conversation

@ChWick
Copy link
Copy Markdown

@ChWick ChWick commented Dec 13, 2017

If a pretrained model is used that has a different codec than the target text (e. g. historical documents) one has to adapt the codec to match the desired characters.

This pull request allows to automatically extend or shrink the codec based on the provided ground truth data after loading a pretrained model. This is done by changing the dimension of the output LSTM layer (before Softmax), whereby the old trained values are kept. Obviously, to learn the new characters the model must be retrained on the new data.

@amitdo
Copy link
Copy Markdown
Contributor

amitdo commented Dec 13, 2017

I like the feature.

Did you test it? How does it compare to training from scratch?

Related:
tmbdev/clstm#106

@ChWick
Copy link
Copy Markdown
Author

ChWick commented Dec 13, 2017

Our paper based on this technique applied to historical documents will hopefully be published this month. I will reference it as soon as it is available. Our findings and the improvements compared to training from scratch are documented there.

@amitdo
Copy link
Copy Markdown
Contributor

amitdo commented Dec 13, 2017

Sounds promising :-)

Comment thread ocrolib/lstm.py
for w,dw,n in self.net.weights():
yield w,dw,"Reversed/%s"%n
def resizeoutput(self, no, deleted_positions):
self.net.resizeOutput(no, deleted_positions)
Copy link
Copy Markdown
Contributor

@amitdo amitdo Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python is case sensitive. 'resizeOutput' != 'resizeoutput'

Use the standard Python style for naming functions/methods
https://www.python.org/dev/peps/pep-0008/#method-names-and-instance-variables

@ChWick
Copy link
Copy Markdown
Author

ChWick commented Dec 15, 2017

Function is renamed. If you prefer a resize_output to resizeOutput let me know.
In the new commit I added support for a FloatingPointingError exception during training. The codec will be resized in this case.
The paper should be available on arXiv on monday.

@amitdo
Copy link
Copy Markdown
Contributor

amitdo commented Dec 15, 2017

Are you talking about this paper?
https://arxiv.org/abs/1711.09670

I added it to the 'Publications' wiki page:
https://github.com/tmbdev/ocropy/wiki/Publications

One of the authors names match your user name...

@ChWick
Copy link
Copy Markdown
Author

ChWick commented Dec 15, 2017

No, this is another paper, but it is not using the resizing of the codec. The new paper was sumitted today to arXiv and therefore should be available on monday

@amitdo
Copy link
Copy Markdown
Contributor

amitdo commented Dec 15, 2017

I saw your deleted comment that says 'You already found it'...
:-)

@chreul
Copy link
Copy Markdown

chreul commented Dec 18, 2017

The corresponding paper is now available at arXiv: https://arxiv.org/abs/1712.05586

@amitdo
Copy link
Copy Markdown
Contributor

amitdo commented Dec 18, 2017

Thanks for sharing your research and code.

Related, Training Tesseract 4.00:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

  • Fine Tuning for Impact(new-font-style)
  • Fine Tuning for ± a few characters
  • Training Just a Few Layers

The second option is similar to what your patch does.
Ocropy does not have the third option.

@amitdo
Copy link
Copy Markdown
Contributor

amitdo commented Dec 19, 2017

@chreul, @ChWick

Figure 1. Different example lines from the seven books used
From top to bottom: excerpts from books 1476, 1488, 1495, 1500, 1505, 1509, and 1572.

For example, book 1505 shows the least improvement over the default approach (but still 23% and 8%, respectively). Most likely this is caused by the fact that the distances between two characters in book 1505 are considerably smaller compared to all other books used for training and testing (see Figure 1, line 4).

Update: Fixed in version 2 (v2) of the paper.

@amitdo
Copy link
Copy Markdown
Contributor

amitdo commented Dec 19, 2017

The location of the description of this patch in the paper:

3.3 Utilizing Arbitrary Pretrained Models in OCRopus

3.3.1 Extending the Codec

3.3.2 Reducing the Codec

@mittagessen
Copy link
Copy Markdown

Fine Tuning for Impact(new-font-style)
Fine Tuning for ± a few characters
Training Just a Few Layers

The second option is similar to what your patch does.

Technically the second and third option are equivalent. In both cases it is slicing off the final linear projection and just training a new one, although the weights are already somewhat meaningful when just a few rows are deleted. It is possible to add a complete weight reinitialization here although I'm unsure if the single LSTM layer learns representations well enough to be worth the effort.

@zuphilip
Copy link
Copy Markdown
Collaborator

@ChWick Thank you this looks very interesting! I have seen, that your paper in the journal 027.7 appeared http://0277.ch/ojs/index.php/cdrs_0277/article/view/169. This needs some time to check in details and test it through...

Comment thread ocrolib/lstm.py Outdated

def normalize_nfkc(s):
return unicodedata.normalize('NFKC',s)
return unicodedata.normalize('NFC',s)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change? Confusing since the method is called normalize_nfkc

Copy link
Copy Markdown
Collaborator

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

Let's merge this once clear whether the NFC/NFKC change was deliberate.

Also, a minimalist test for CI would be helpful: Train a minimal model, extend&shrink the character set, make sure it doesn't break. Maybe you have such sample data from developing this?

@ChWick
Copy link
Copy Markdown
Author

ChWick commented Feb 20, 2018

The NFC/NFKC change was needed for our purposes but apparently not for this pull request. The change is undone, my branch is rebased onto the current master.

I propose as test 2 single text lines with different alphabet. Use the --codec argument to generate the appropriate codec for the initial model and the second model that loads the first one.
E.g.:

  1. ocropus-rtrain --codec text_line_gt.txt --ntrain 2 -F 1 --output tmp test_file.png
  2. ocropus-rtrain --codec 2nd_text_line_gt.txt --ntrain 2 --load tmp-00000001.pyrnn.gz 2nd_test_file.png

Also this must work (default codec in the initial model)

  1. ocropus-rtrain --ntrain 2 -F 1 --output tmp test_file.png
  2. ocropus-rtrain --codec 2nd_text_line_gt.txt --ntrain 2 --load tmp-00000001.pyrnn.gz 2nd_test_file.png

And the content of test_line_gt.txt is e.g. ABCDEFG, the one of 2nd_test_line_gt.txt: EFGHIJKL.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants