81295663450179fb9ca6fe6d62d3c4934a6c8174
By:
hamlinux

Does anyone have any AWK , SED, GREP, fu for parsing DO billing

February 23, 2017 393 views
Billing

Does anyone have any AWK , SED, GREP, fu for parsing DO billing info (Billing history pdf) to more easily invoice clients?

1 comment
1 Answer

@hamlinux

If you're using Ubuntu, you can install poppler-utils and use pdftotext which will convert a PDF to a text file. You could then process the text file as you see fit.

To keep the layout the same as it's shown on the invoice, you'd use something such as:

pdftotext -layout my.pdf

The above will create a file called my.txt in the same directory, which is the text-based version of the PDF.

Once my.txt is created, you could use something such as:

grep 'droplet-name' my.txt

... and get output that looks something like this:

droplet-name (4GB)                        125     01-01 00:00   01-06 05:21   $7.44
droplet-name (4GB Backup Services)                01-01 00:00   01-06 05:21   $0.37

If you need to get the hours a Droplet was in use, something such as the following would work (very basic example without going in to regex or similar):

grep 'droplet-name (4GB)' my.txt | awk '{print $2}'

You could then grab the total cost (i.e. $7.44) by running:

grep 'droplet-name (4GB)' my.txt | awk '{print $8}'

Essentially, we're just counting the columns which is where $2 and $8 come from.

That being said, once the PDF is converted to text, you may be better of using your programming language of choice to process the data as processing using bash can get a little complex as well as finicky as you need to address various scenarios depending on what you use.

For example, with PHP (chosen as that's what I'm working with right now), we could use something like the mini-script I have below. It's designed to run from the CLI using:

php name-of-script.php

... and accepts three arguments. The first is the name of the file that you've converted to text using the function above, the second being the hostname of the Droplet or the Droplet name, and the third is by default false, but if true is passed as a third argument, instead of returning an array of data, it'll return a json encoded string.

Usage:

php name-of-script.php /path/to/my.txt droplet_name

or

php name-of-script.php /path/to/my.txt droplet_name true
<?php
function getDropletBilling( $file, $droplet, $json = false )
{
    if ( file_exists( $file ) )
    {
        $data = fopen( $file, 'r' );

        if ( $data )
        {
            $client = [];

            while ( ( $line = fgets( $data ) ) !== false )
            {
                if ( strpos( $line, $droplet ) !== false )
                {
                    $dropletData = preg_replace( "/\([^)]+\)/", '', $line );
                    $dropletData = explode( ' ', $dropletData );
                    $dropletData = array_filter( $dropletData );
                    $dropletData = array_values( $dropletData );

                    $client[] = $dropletData;
                }
            }
        }

        fclose( $data );

        if ( false === $json )
        {
            return $client;
        }
        else
        {
            return json_encode( $client );
        }
    }
}

if ( ! empty( $argv ) )
{
    array_shift( $argv );

    $count = count( $argv );

    if ( $count > 3 )
    {
        throw InvalidArgumentException(
            'Function: getDropletBilling() expects two arguments, ' . $count . ' given.'
        );
    }
    else
    {
        if ( $count === 2 )
        {
            return getDropletBilling( $argv[0], $argv[1] );
        }
        else
        {
            return getDropletBilling( $argv[0], $argv[1], $argv[2] );
        }
    }
}

As an example, if you don't pass true, it'll return an array, or multiple arrays depending on how many times droplet_name shows up in the text file. For example:

array(2) {
  [0]=>
  array(7) {
    [0]=>
    string(11) "droplet_name"
    [25]=>
    string(3) "125"
    [30]=>
    string(5) "01-01"
    [31]=>
    string(5) "00:00"
    [34]=>
    string(5) "01-06"
    [35]=>
    string(5) "05:21"
    [38]=>
    string(6) "$7.44"
  }
  [1]=>
  array(6) {
    [0]=>
    string(11) "droplet_name"
    [17]=>
    string(5) "01-01"
    [18]=>
    string(5) "00:00"
    [21]=>
    string(5) "01-06"
    [22]=>
    string(5) "05:21"
    [25]=>
    string(6) "$0.37"
  }
}

If you pass true, which returns a json encoded string, the above will look like:

[{"0":"droplet_name","25":"125","30":"01-01","31":"00:00","34":"01-06","35":"05:21","38":"$7.44\n"},{"0":"droplet_name","17":"01-01","18":"00:00","21":"01-06","22":"05:21","25":"$0.37\n"}]

The results detail:

  • The name of the droplet
  • Date of Creation
  • Time of Creation
  • Date of Destruction
  • Time of Destruction
  • Total Cost

...

Of course, the above is just one way of doing it without the CLI. You may not want to use PHP, and that's, of course, perfectly okay :-). This is just showing you how it could be done. Results could be modified more, though for the purpose of this example, I chose to strip out anything inside (), so the size of the Droplet isn't included in the result set.

Have another answer? Share your knowledge.