Probate Parsing Solution by shahsaumya · Pull Request #8 · FreeUKGen/SummerOfCodeImages

shahsaumya · 2018-04-04T20:29:12Z

Refer Issue #7

The system that I propose to implement is an end-to-end system that extracts the text from probate books and seeds them into a database with entities such as name, county, date, relationships etc. This system can, therefore, be broken down into three phases -

Text extraction using Optical Character Recognition
Named Entity Recognition using Language Processing
Database Seeding based on the entities generated

Due to lack of samples to train a Named Entity Recognizer, I've made use of the Stanford NER Wrapper and NLTK to produce the results.

1. OCR using pytesseract - ocr.py 2. Named Entity Recognition using NLTK and Stanford NLP Wrapper a)NLTK - nltk_ner.py b)Stanford NLP - stanford_ner.py 3. To get a good idea of prerequisites and execution details - README.md

shahsaumya added 2 commits April 5, 2018 01:42

Probate Parsing Solution

682099f

1. OCR using pytesseract - ocr.py 2. Named Entity Recognition using NLTK and Stanford NLP Wrapper a)NLTK - nltk_ner.py b)Stanford NLP - stanford_ner.py 3. To get a good idea of prerequisites and execution details - README.md

Fixes in README

bd5a0ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probate Parsing Solution#8

Probate Parsing Solution#8
shahsaumya wants to merge 2 commits intoFreeUKGen:masterfrom
shahsaumya:master

shahsaumya commented Apr 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shahsaumya commented Apr 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant