Patterns in Text

To prepare the minimum wage dataset, we must combine the data on state minimum wage rates with data on employment status, average annual pay and consumer prices.

The minimum wage data and the employment data are in CSV files like the ones in our discussion of reading and writing. The consumer price data and the average annual pay are plain text files and not so easy to work with.

Both text files contain a series of annual series. In the consumer price file, the consumer price index is followed by the 12-month percentage change. In the average annual pay file, the series follow each other alphabetically by state name.

And because the state name is listed above the data itself (not alongside of it), we must capture the two pieces of information separately by identifying patterns in text. We must also edit some of the text.

To explain the pattern recognition and replacement tools used to assemble the minimum wage dataset, we will first review a simple example -- Perl's substitution operator, s///. Then, we will use pattern recognition to capture the data of interest in the plain text files and edit that data when necessary.

simple example

A simple pattern to recognize is a word. And a simple replacement is another word. So, for a simple example, we will first substitute one word with another word. Specifically, we will create the following list of people and their occupations, then we will change those occupations.

## array of people and occupations
my @people = (
    "Joseph is a college professor.",
    "Jennifer is a college professor.",
    "The waiter is a college professor."
);

Now suppose that all three college professors leave the classroom and become business owners. To update our array, we use s/// to substitute college professor with business owner:

## change occupations
s/college professor/business owner/ for @people ;

## print people and new occupations
print $_ . "\n" for @people ;

The lines above would print the following list:

Joseph is a business owner.
Jennifer is a business owner.
The waiter is a business owner.

But now suppose that Joseph and Jennifer become business owners, while the waiter continues to supplement his income with college teaching. We need a pattern that identifies the people whose occupations need to be updated.

In this case, we might update the occupations of people who do not have a space in their names. "The waiter" has a space in his name. Joseph and Jennifer do not.

So here, we use the caret, ^, to search from the beginning of the field; we use parenthesis, (), to store part of the pattern in the $1 variable; we use [A-Z] to find a capital letter; we use [a-z]+ to find a series of lower case letters (where the plus operator, +, indicates that the [a-z] pattern may repeat); and we use \s to find a space:

## substitution when there is no space in name
s/^([A-Z][a-z]+\s)is a college professor/$1is a business owner/ for @people ;

## print people and new occupations
print $_ . "\n" for @people ;

The lines above perform the substitution for Joseph and Jennifer, but not for "the waiter" (who has a space in his name) and print the following list:

Joseph is a business owner.
Jennifer is a business owner.
The waiter is a college professor.

assembling the dataset

In the file containing the average annual pay data, the state name is listed above the data, so we must capture the state name separately from the state's average annual pay statistic. To hold the state name as while loops through the file, we first assign a scalar to store that state name:

## remember which state we're examining
my $state_now = "" ;

Then, at each line of data, we determine if it contains a state name or a statistic. If the line contains the state name, we edit the line to extract that name. If the line contains a statistic, we identify the year and value, remove preliminary tags, (P), and pass the information to the hash:

## read in the data, skipping first eight lines
open( INAPAY , $inapay ) || die "could not open $inapay" ;
<INAPAY> while $. < 8 ;
while (<INAPAY>) {
    chomp;
    my $line = $_ ;

    ## if the line contains state name, then store state name
    if ($line =~ /^State:\s+/) {

        ## state name is the line, after substituting out field identifier
        ( $state_now = $line ) =~ s/^State:\s+// ;
    }

    ## if the line contains year 2000 or more recent, then store data
    if ($line =~ /^20[01][0-9],/ ) {

        ## split fields by comma
        my ($year,$value) = split( ',' , $line ) ;

        ## remove "(P)" preliminary tag
        $value =~ s/\(P\)// ;

        ## store data in hash
        $indata{$state_now}{$year}{"avg_annl_pay"} = $value ;
    }
}
close INAPAY ;

More details on how I assembled the minimum wage dataset can be found my Perl script. And in our discussion of regressions, we will explore the assembled dataset and attempt to measure the effect of the minimum wage on employment while controlling for the effects of other variables, like inflation and average annual pay.

Copyright © 2002-2024 Eryk Wdowiak