#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (1)

I have clearly been out of the loop because I have only recently learned about the tesseract library in R. If I knew about it earlier I would have wrote about it much sooner!

The tesseract library is a package which has bindings to the Tesseract-OCR engine: a powerful optical character recognition (OCR) engine that supports over 100 languages and enables users to scan and extract text from pictures, which has direct applications in any field that deals with with high amounts of manual data processing- like accounting, mortgages/real estate, insurance and archival work. In this blog I am going to explore how its possible to parse a JPMorgan Chase bank statement using the tesseract,pdftools, stringr, and tidyverse libraries.

Disclaimer: The bank statement I am using is from a sample template and is not personal information. The statement sample can be accessed here; the document being parsed is shown below.

From what I’ve seen in the CRAN documentation we need to first convert the .pdf file into .png files. This can be done with the pdf_convert() function available in the pdftools package. To ensure that text will be read accurately, setting the dpi argument to a large number is recommended.

bank_statement <- pdftools::pdf_convert("sample-bank-statement.pdf", dpi = 1000)

## Converting page 1 to sample-bank-statement_1.png... done!## Converting page 2 to sample-bank-statement_2.png... done!## Converting page 3 to sample-bank-statement_3.png... done!## Converting page 4 to sample-bank-statement_4.png... done!

Now that the bank statement has been converted to .png files it can now be read with tesseract. Right now the data is very unstructured and it needs to be parsed. I’ll save the output for you, but you can see it for yourself if you output the raw_text vector on your machine.

library(tesseract)raw_text <- ocr(bank_statement, engine = tesseract("eng") )

From a bank statement, businesses are interested in data from the fields listed in document. Namely the:

Deposits and additions,
Checks paid,
Other withdrawals, fees and charges; and
Daily ending balances.

While this bank statement is relatively small and only consists of 4 pages, to create a general method for extracting data from JPMorgan Chase bank statements which could be larger, the scanned text will need to be combined into a single text file and then parsed accordingly.

# Bind raw text togetherraw_text_combined<- paste(raw_text,collapse="")

To get the data for the desired fields, we can use the field titles as anchors for parsing the data. It was with this and with the help of regex101.com this cleaning script was constructed. Beyond the particular regular expressions involved, the cleaning script relies heavily on the natural anchors in the text relating encompassing the values where they begin (around the title of the table they belong) and where they end (just before the total amount).

library(tidyverse)library(stringr)deposits_and_additions<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract("(?<=DEPOSITS AND ADDITIONS).*(?=Total Deposits and Additions)") %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), transaction = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=[A-z] ).*")%>% str_extract('\\d{1,}.*'))checks_paid <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=CHECKS PAID).*(?=Total Checks Paid)') %>% # Would have to change regex to get check numbers and description but its not relevant here str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))others <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?=OTHER WITHDRAWALS, FEES & CHARGES).*(?=Total Other Withdrawals, Fees & Charges)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), description = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))daily_ending_balances<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=DAILY ENDING BALANCE).*(?=SERVICE CHARGE SUMMARY)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% lapply(function(x) { x %>% str_split('(?= \\d{2}\\/\\d{2} )')}) %>% unlist() %>% as_tibble(.name_repair = "unique") %>% transmute(date= value %>% str_extract('\\d{2}\\/\\d{2}'), amount = value %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

The cleaned data extracted from the bank statement is:

statement_data <- list("DEPOSITS AND ADDITIONS"=deposits_and_additions, "CHECKS PAID"=checks_paid, "OTHER WITHDRAWALS, FEES & CHARGES"=others, "DAILY ENDING BALANCE"=daily_ending_balances)statement_data

## $`DEPOSITS AND ADDITIONS`## # A tibble: 10 x 3## date transaction amount ## <chr> <chr> <chr> ## 1 07/02 Deposit 17,120.00## 2 07/09 Deposit 24,610.00## 3 07/14 Deposit 11,424.00## 4 07/15 Deposit 1,349.00 ## 5 07/21 Deposit 5,000.00 ## 6 07/21 Deposit 3,120.00 ## 7 07/23 Deposit 33,138.00## 8 07/28 Deposit 18,114.00## 9 07/30 Deposit 6,908.63 ## 10 07/30 Deposit 5,100.00 ## ## $`CHECKS PAID`## # A tibble: 2 x 2## date amount ## <chr> <chr> ## 1 07/14 1,471.99## 2 07/08 1,697.05## ## $`OTHER WITHDRAWALS, FEES & CHARGES`## # A tibble: 4 x 3## date description amount ## <chr> <chr> <chr> ## 1 07/11 Online Payment XXXXX To Vendor 8,928.00## 2 07/11 Online Payment XXXXX To Vendor 2,960.00## 3 07/25 Online Payment XXXXX To Vendor 250.00 ## 4 07/30 ADP TX/Fincl Svc ADP 2,887.68## ## $`DAILY ENDING BALANCE`## # A tibble: 11 x 2## date amount ## <chr> <chr> ## 1 07/02 98,727.40 ## 2 07/21 129,173.36## 3 07/08 97,030.35 ## 4 07/23 162,311.36## 5 07/09 121,640.35## 6 07/25 162,061.36## 7 07/11 109,752.35## 8 07/28 180,175.36## 9 07/14 108,280.36## 10 07/30 189,296.31## 11 07/16 121,053.36

There you have it! The tesseract package has opened up a world of data processing tools that I now have at my disposal and I hope I was able to show it in this blog. While this blog only focused on JPMorgan Chase bank statements, its possible to apply the same techniques to other bank statements by having the cleaning script tweaked accordingly.

Thank you for reading!

Want to see more of my content?

Be sure to subscribe and never miss an update!

YouTube

Facebook

Patreon

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

FAQs

What is Tesseract OCR used for? ›

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

Find Out More ›

How to use Google Tesseract OCR? ›

Steps in the Tesseract OCR process

API request – Tesseract OCR can only be accessed via API integration. ...
Input Image – With an API request, you can send in your input image for text extraction.
Image pre-processing – Before data extraction, the image pre-processing features of the Tesseract OCR engine kick in.

More items...

Jan 9, 2024

Where is OCR data stored? ›

Data stored by OCR

To facilitate checking the completion status and returning the extracted results to the customer upon completion, the extracted text is stored temporarily in Azure Storage.

Learn More Now ›

What does OCR mean on Google Drive? ›

Optical character recognition (OCR) is a technology that extracts text from images. It scans GIF, JPG, PNG, and TIFF images. If you turn it on, the extracted text is then subject to any content compliance or objectionable content rules you set up for Gmail messages.

Learn More Now ›

Does Tesseract OCR require Internet? ›

Tesseract OCR is an offline tool, which provides some options it can be run with. The one that makes the most difference in the example problems we have here is page segmentation mode.

Get More Info ›

What is the best image for Tesseract? ›

Rescaling. Tesseract works best on images with a Dot Per Inch (DPI) of at least 300 dpi, so it might be better to resize images before passing it to Tesseract.

Find Out More ›

How do I know if Tesseract OCR is installed? ›

To test that Tesseract OCR for Windows was installed successfully, open command prompt on your machine, then run the Tesseract command. You should see an output with a quick explanation of Tesseract's usage options. Congratulations! You've successfully installed Tesseract OCR for Windows on your machine.

Find Out More ›

How accurate is Tesseract OCR? ›

One of the biggest results was that the accuracy of Tesseract did depend on the size of the font, as well as the quality of the image created. The font sizes that did the worst were sizes 8 to 10, and all of the “average” quality images (75 dpi or pixels) performed really bad no matter what font or font size.

Know More ›

Can Tesseract recognize handwriting? ›

Extract Printed Text

Now we can use Tesseract OCR with Python to extract text from the image segments. Tesseract OCR doesn't work well on handwritten texts. When passing the handwritten segment into Tesseract, we get very poor reading results.

Discover More ›

Is Google OCR better than Tesseract? ›

Character detection accuracy: In comparison to Google Vision, Tesseract does not perform as well with complex characters (for example, historical characters and ligatures).

Get More Info ›

Which OCR is better than Tesseract? ›

docTR performs better than Tesseract on many document types it struggles on: scanned documents, screenshots, documents with strange fonts, etc. docTR's recall and precision using some models are much better than Tesseract and even some of proprietary cloud-based services as demonstrated in their benchmark table.

Explore More ›