Skip to main content

PDF data extraction using Windows PowerShell

PDF data extraction using Windows PowerShell and pdftotext command line utility provided by xpdftool.


In this tutorial we will try to extract some data from a batch of invoices and save it in text file

The approach would be:-

1) Get a list of Invoices as PDF files.
2) Convert the 1st PDF to plain text file using pdftotext utility.
3) Find the required information and clean it.
4) Save the data to flat file.
5) Repeat steps 2 to 4 for additional files.


Let's start

Open Run Dialog Box by pressing Windows Key + R


Open Powershell ISE by typing, well powershell_ise.exe in Run Dialog as shown.




Tip :- PowerShell Integrated Scripting Environment (ISE) is as an IDE for developing and testing Powershell scripts. You can also use any other text editor (notepad, notepad++) to write the powershell scripts. The IDE looks like this.




Now let's dive into the script.

A) Set the variables

#### Variable setting start #####

# Set the pdftotext.exe location.
$pdfToTextPath='D:\Blog\PowerShell PDF Data Extract' # Set the invoice pdf location.
$invoicePath='D:\Blog\PowerShell PDF Data Extract\SampleInvoicePDF' #### Variable setting end #####

B) Do some verifications


# Check if the file pdftotext.exe exists at the provided location.
If (Test-Path "$pdfToTextPath\pdftotext.exe"){
$pttExe = "$pdfToTextPath\pdftotext.exe" }

else

{

Write-Host 'Check if the file pdftotext.exe exists at' $pdfToTextPath
Exit
} # Get the list of invoices and check for atleast one invoice.
$invoiceList = Get-ChildItem -Path $invoicePath -Filter *.pdf if ($invoiceList.Count -lt 1){ Write-Host 'Check if the invoice pdf exists at' $invoicePath Exit }

C) Create the output file with data header.

# Create the output file.

$outFileName = ("{0}{1}" -f (Get-Date -Format 'ddMMMyyyy_HHmmss'),'.txt')
$header = 'CustomerName|InvoiceNo|AmountDue' $header | Set-Content $pdfToTextPath\$outFileName

D) Set up the loop to process each pdf file.

# Set up the processing loop

foreach ($invoice in $invoiceList) {

# Convert the PDF invoice to txt file
& $pttExe -table $invoice.FullName $invoicePath\'processed.txt' foreach($line in Get-Content $invoicePath\'processed.txt') { # Get the Customer Name and Invoice Number if($line -match 'Invoice For'){
#Customer Name $custNameStartIndex = 12 # length of string 'Invoice For' + 1. Can be counted by 'Invoice For'.Length $invoiceNoStartIndex = ($line.IndexOf('Invoice ID')) $custName = $line.Substring($custNameStartIndex,$invoiceNoStartIndex - $custNameStartIndex)
$custName = $custName.Replace('_','').Trim() #Invoice Number $invoiceNo = $line.Substring($invoiceNoStartIndex + 'Invoice ID'.Length)
$invoiceNo = $invoiceNo.Replace('_','').Trim() } # Get the Amount Due if($line -match "Amount Due"){ $amountStartIndex = $line.IndexOf("Amount Due") + "Amount Due".Length $amountDue = $line.Substring($amountStartIndex)
$amountDue = $amountDue.Replace("_","").Trim() } } ("{0}|{1}|{2}" -f $custName,$invoiceNo,$amountDue) | Add-Content $pdfToTextPath\$outFileName }

That's it, Your output file will be generated after this.

You can download the sample invoices here and the whole script file here. The pdftotext utility can be downloaded from https://www.xpdfreader.com/download.html or here.


Comments

Popular posts from this blog

How to draw a rounded control in .Net

Although I enjoy reading a lot, this is my very first attempt at writing (and what? "A Blog"). I don't know the basics of writing, so I take for granted that mistakes are ignored ( Better if you comment your heart out). This post is related to programming a PictureBox object to rotate around another PictureBox object. This is an idea improved upon by me ( The original one is posted by  +Rahul Jayant  . Here is the link ). Let's now start the fun:- We are going to use Visual Studio 2010 and C# ( Although I personally prefer VB.net). Steps are as follows :- 1) Create a new project and name it TestDemo (You can name it whatever you want). 2)  Set the size of the form (should be Form1) to 800 x 800. 3) Insert two PictureBox. Adjust the Location property First PictureBox      : 200,200 Second PictureBox : 500,300 Adjust the size property First PictureBox      : 300,300 Second PictureBox : 100,100 4) Download the im...