PDF data extraction using Windows PowerShell and pdftotext command line utility provided by xpdftool.
In this tutorial we will try to extract some data from a batch of invoices and save it in text file
The approach would be:-
1) Get a list of Invoices as PDF files.
2) Convert the 1st PDF to plain text file using pdftotext utility.
3) Find the required information and clean it.
4) Save the data to flat file.
5) Repeat steps 2 to 4 for additional files.
Let's start
Open Run Dialog Box by pressing Windows Key + R
Open Powershell ISE by typing, well powershell_ise.exe in Run Dialog as shown.
Tip :- PowerShell Integrated Scripting Environment (ISE) is as an IDE for developing and testing Powershell scripts. You can also use any other text editor (notepad, notepad++) to write the powershell scripts. The IDE looks like this.
Now let's dive into the script.
A) Set the variables
#### Variable setting start ##### # Set the pdftotext.exe location. |
B) Do some verifications
# Check if the file pdftotext.exe exists at the provided location.
If (Test-Path "$pdfToTextPath\pdftotext.exe"){
$pttExe = "$pdfToTextPath\pdftotext.exe" }
else
{
Write-Host 'Check if the file pdftotext.exe exists at' $pdfToTextPath
Exit
} # Get the list of invoices and check for atleast one invoice.
$invoiceList = Get-ChildItem -Path $invoicePath -Filter *.pdf if ($invoiceList.Count -lt 1){ Write-Host 'Check if the invoice pdf exists at' $invoicePath Exit }
C) Create the output file with data header.
# Create the output file. $outFileName = ("{0}{1}" -f (Get-Date -Format 'ddMMMyyyy_HHmmss'),'.txt')
$header = 'CustomerName|InvoiceNo|AmountDue' $header | Set-Content $pdfToTextPath\$outFileName
D) Set up the loop to process each pdf file.
# Set up the processing loop foreach ($invoice in $invoiceList) {
# Convert the PDF invoice to txt file
& $pttExe -table $invoice.FullName $invoicePath\'processed.txt' foreach($line in Get-Content $invoicePath\'processed.txt') { # Get the Customer Name and Invoice Number if($line -match 'Invoice For'){
#Customer Name $custNameStartIndex = 12 # length of string 'Invoice For' + 1. Can be counted by 'Invoice For'.Length $invoiceNoStartIndex = ($line.IndexOf('Invoice ID')) $custName = $line.Substring($custNameStartIndex,$invoiceNoStartIndex - $custNameStartIndex)
$custName = $custName.Replace('_','').Trim() #Invoice Number $invoiceNo = $line.Substring($invoiceNoStartIndex + 'Invoice ID'.Length)
$invoiceNo = $invoiceNo.Replace('_','').Trim() } # Get the Amount Due if($line -match "Amount Due"){ $amountStartIndex = $line.IndexOf("Amount Due") + "Amount Due".Length $amountDue = $line.Substring($amountStartIndex)
$amountDue = $amountDue.Replace("_","").Trim() } } ("{0}|{1}|{2}" -f $custName,$invoiceNo,$amountDue) | Add-Content $pdfToTextPath\$outFileName }
That's it, Your output file will be generated after this.
You can download the sample invoices here and the whole script file here. The pdftotext utility can be downloaded from https://www.xpdfreader.com/download.html or here.



Comments
Post a Comment