#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (1)

I have clearly been out of the loop because I have only recently learned about the tesseract library in R. If I knew about it earlier I would have wrote about it much sooner!

The tesseract library is a package which has bindings to the Tesseract-OCR engine: a powerful optical character recognition (OCR) engine that supports over 100 languages and enables users to scan and extract text from pictures, which has direct applications in any field that deals with with high amounts of manual data processing- like accounting, mortgages/real estate, insurance and archival work. In this blog I am going to explore how its possible to parse a JPMorgan Chase bank statement using the tesseract,pdftools, stringr, and tidyverse libraries.

Disclaimer: The bank statement I am using is from a sample template and is not personal information. The statement sample can be accessed here; the document being parsed is shown below.

  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2)
  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (3)
  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (4)
  • #RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (5)

From what I’ve seen in the CRAN documentation we need to first convert the .pdf file into .png files. This can be done with the pdf_convert() function available in the pdftools package. To ensure that text will be read accurately, setting the dpi argument to a large number is recommended.

bank_statement <- pdftools::pdf_convert("sample-bank-statement.pdf", dpi = 1000)
## Converting page 1 to sample-bank-statement_1.png... done!## Converting page 2 to sample-bank-statement_2.png... done!## Converting page 3 to sample-bank-statement_3.png... done!## Converting page 4 to sample-bank-statement_4.png... done!

Now that the bank statement has been converted to .png files it can now be read with tesseract. Right now the data is very unstructured and it needs to be parsed. I’ll save the output for you, but you can see it for yourself if you output the raw_text vector on your machine.

From a bank statement, businesses are interested in data from the fields listed in document. Namely the:

  • Deposits and additions,
  • Checks paid,
  • Other withdrawals, fees and charges; and
  • Daily ending balances.

While this bank statement is relatively small and only consists of 4 pages, to create a general method for extracting data from JPMorgan Chase bank statements which could be larger, the scanned text will need to be combined into a single text file and then parsed accordingly.

# Bind raw text togetherraw_text_combined<- paste(raw_text,collapse="")

To get the data for the desired fields, we can use the field titles as anchors for parsing the data. It was with this and with the help of regex101.com this cleaning script was constructed. Beyond the particular regular expressions involved, the cleaning script relies heavily on the natural anchors in the text relating encompassing the values where they begin (around the title of the table they belong) and where they end (just before the total amount).

library(tidyverse)library(stringr)deposits_and_additions<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract("(?<=DEPOSITS AND ADDITIONS).*(?=Total Deposits and Additions)") %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), transaction = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=[A-z] ).*")%>% str_extract('\\d{1,}.*'))checks_paid <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=CHECKS PAID).*(?=Total Checks Paid)') %>% # Would have to change regex to get check numbers and description but its not relevant here str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))others <- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?=OTHER WITHDRAWALS, FEES & CHARGES).*(?=Total Other Withdrawals, Fees & Charges)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% as_tibble(.name_repair = "unique") %>% transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'), description = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'), amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))daily_ending_balances<- raw_text_combined %>% # Need to replace \n str_replace_all('\\n',';') %>% str_extract('(?<=DAILY ENDING BALANCE).*(?=SERVICE CHARGE SUMMARY)') %>% str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% str_split(';') %>% lapply(function(x) { x %>% str_split('(?= \\d{2}\\/\\d{2} )')}) %>% unlist() %>% as_tibble(.name_repair = "unique") %>% transmute(date= value %>% str_extract('\\d{2}\\/\\d{2}'), amount = value %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

The cleaned data extracted from the bank statement is:

statement_data <- list("DEPOSITS AND ADDITIONS"=deposits_and_additions, "CHECKS PAID"=checks_paid, "OTHER WITHDRAWALS, FEES & CHARGES"=others, "DAILY ENDING BALANCE"=daily_ending_balances)statement_data
## $`DEPOSITS AND ADDITIONS`## # A tibble: 10 x 3## date transaction amount ## <chr> <chr> <chr> ## 1 07/02 Deposit 17,120.00## 2 07/09 Deposit 24,610.00## 3 07/14 Deposit 11,424.00## 4 07/15 Deposit 1,349.00 ## 5 07/21 Deposit 5,000.00 ## 6 07/21 Deposit 3,120.00 ## 7 07/23 Deposit 33,138.00## 8 07/28 Deposit 18,114.00## 9 07/30 Deposit 6,908.63 ## 10 07/30 Deposit 5,100.00 ## ## $`CHECKS PAID`## # A tibble: 2 x 2## date amount ## <chr> <chr> ## 1 07/14 1,471.99## 2 07/08 1,697.05## ## $`OTHER WITHDRAWALS, FEES & CHARGES`## # A tibble: 4 x 3## date description amount ## <chr> <chr> <chr> ## 1 07/11 Online Payment XXXXX To Vendor 8,928.00## 2 07/11 Online Payment XXXXX To Vendor 2,960.00## 3 07/25 Online Payment XXXXX To Vendor 250.00 ## 4 07/30 ADP TX/Fincl Svc ADP 2,887.68## ## $`DAILY ENDING BALANCE`## # A tibble: 11 x 2## date amount ## <chr> <chr> ## 1 07/02 98,727.40 ## 2 07/21 129,173.36## 3 07/08 97,030.35 ## 4 07/23 162,311.36## 5 07/09 121,640.35## 6 07/25 162,061.36## 7 07/11 109,752.35## 8 07/28 180,175.36## 9 07/14 108,280.36## 10 07/30 189,296.31## 11 07/16 121,053.36

There you have it! The tesseract package has opened up a world of data processing tools that I now have at my disposal and I hope I was able to show it in this blog. While this blog only focused on JPMorgan Chase bank statements, its possible to apply the same techniques to other bank statements by having the cleaning script tweaked accordingly.

Thank you for reading!

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (6)

Want to see more of my content?

Be sure to subscribe and never miss an update!

#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data (2024)

FAQs

What is Tesseract OCR used for? ›

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

How to use Google Tesseract OCR? ›

Steps in the Tesseract OCR process
  1. API request – Tesseract OCR can only be accessed via API integration. ...
  2. Input Image – With an API request, you can send in your input image for text extraction.
  3. Image pre-processing – Before data extraction, the image pre-processing features of the Tesseract OCR engine kick in.
Jan 9, 2024

What is OCR data extraction? ›

Optical Character Recognition, or OCR as it is commonly known, is a type of software that converts scanned images into structured data that is extractable, editable and searchable. When your records are digitized, scanning is only the first step.

What is the difference between Tesseract and Tesseract OCR? ›

In Summary, Tesseract OCR provides high accuracy, extensive language support, and flexibility through its standalone software. On the other hand, Tesseract. js offers in-browser OCR with easy integration, making it suitable for real-time text extraction and web-based applications.

Why do people use OCR software? ›

Text in images cannot be processed by word processing software in the same way as text documents. OCR technology solves the problem by converting text images into text data that can be analyzed by other business software.

What is an example of OCR? ›

In short, optical character recognition software helps convert images or physical documents into a searchable form. Examples of OCR are text extraction tools, PDF to . txt converters, and Google's image search function.

How do I use OCR on my phone? ›

Download an OCR app: There are several OCR apps available for both iOS and Android devices, such as Adobe Scan, CamScanner, and ABBYY FineScanner. Take a photo of the text you want to scan: Launch the OCR app and take a photo of the text you want to scan. The app will automatically detect the text and process it.

Is Google Tesseract free? ›

Tesseract is a free and open-source command line OCR engine that was developed at Hewlett-Packard in the mid-80s, and has been maintained by Google since 2006.

Does Tesseract need Internet? ›

Since it is open source and thus runs locally, the only cost of using Tesseract are the resources the machine uses, and there is no need to communicate the document and the results over the internet.

What is the full form of OCR in banking? ›

OCR meaning in banking

OCR, or optical character recognition, is the process of employing technology to read text that has been scribbled or printed and placed into digital representations of the original documents, such scanned paper documents.

Where is OCR data stored? ›

Data stored by OCR

To facilitate checking the completion status and returning the extracted results to the customer upon completion, the extracted text is stored temporarily in Azure Storage.

What does OCR mean on Google Drive? ›

Optical character recognition (OCR) is a technology that extracts text from images. It scans GIF, JPG, PNG, and TIFF images. If you turn it on, the extracted text is then subject to any content compliance or objectionable content rules you set up for Gmail messages.

Does Tesseract OCR require Internet? ›

Tesseract OCR is an offline tool, which provides some options it can be run with. The one that makes the most difference in the example problems we have here is page segmentation mode.

What is the best image for Tesseract? ›

Rescaling. Tesseract works best on images with a Dot Per Inch (DPI) of at least 300 dpi, so it might be better to resize images before passing it to Tesseract.

How do I know if Tesseract OCR is installed? ›

To test that Tesseract OCR for Windows was installed successfully, open command prompt on your machine, then run the Tesseract command. You should see an output with a quick explanation of Tesseract's usage options. Congratulations! You've successfully installed Tesseract OCR for Windows on your machine.

How accurate is Tesseract OCR? ›

One of the biggest results was that the accuracy of Tesseract did depend on the size of the font, as well as the quality of the image created. The font sizes that did the worst were sizes 8 to 10, and all of the “average” quality images (75 dpi or pixels) performed really bad no matter what font or font size.

Can Tesseract recognize handwriting? ›

Extract Printed Text

Now we can use Tesseract OCR with Python to extract text from the image segments. Tesseract OCR doesn't work well on handwritten texts. When passing the handwritten segment into Tesseract, we get very poor reading results.

Is Google OCR better than Tesseract? ›

Character detection accuracy: In comparison to Google Vision, Tesseract does not perform as well with complex characters (for example, historical characters and ligatures).

Which OCR is better than Tesseract? ›

docTR performs better than Tesseract on many document types it struggles on: scanned documents, screenshots, documents with strange fonts, etc. docTR's recall and precision using some models are much better than Tesseract and even some of proprietary cloud-based services as demonstrated in their benchmark table.

Top Articles
Latest Posts
Article information

Author: Msgr. Refugio Daniel

Last Updated:

Views: 6707

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.