Patterns in Text
The minimum wage data and the employment data are in CSV files like the ones in our discussion of reading and writing. The consumer price data and the average annual pay are plain text files and not so easy to work with.
Both text files contain a series of annual series. In the consumer price file, the consumer price index is followed by the 12-month percentage change. In the average annual pay file, the series follow each other alphabetically by state name.
And because the state name is listed above the data itself (not alongside of it), we must capture the two pieces of information separately by identifying patterns in text. We must also edit some of the text.
To explain the pattern recognition and replacement tools used to assemble the minimum wage dataset, we will first review a simple example -- Perl's substitution operator, s///. Then, we will use pattern recognition to capture the data of interest in the plain text files and edit that data when necessary.
A simple pattern to recognize is a word. And a simple replacement is another word. So, for a simple example, we will first substitute one word with another word. Specifically, we will create the following list of people and their occupations, then we will change those occupations.
## array of people and occupations
Now suppose that all three college professors leave the classroom and become business owners. To update our array, we use s/// to substitute college professor with business owner:
## change occupations
The lines above would print the following list:
Joseph is a business owner.
But now suppose that Joseph and Jennifer become business owners, while the waiter continues to supplement his income with college teaching. We need a pattern that identifies the people whose occupations need to be updated.
In this case, we might update the occupations of people who do not have a space in their names. "The waiter" has a space in his name. Joseph and Jennifer do not.
So here, we use the caret, ^, to search from the beginning of the field; we use parenthesis, (), to store part of the pattern in the $1 variable; we use [A-Z] to find a capital letter; we use [a-z]+ to find a series of lower case letters (where the plus operator, +, indicates that the [a-z] pattern may repeat); and we use \s to find a space:
## substitution when there is no space in name
The lines above perform the substitution for Joseph and Jennifer, but not for "the waiter" (who has a space in his name) and print the following list:
Joseph is a business owner.
assembling the dataset
In the file containing the average annual pay data, the state name is listed above the data, so we must capture the state name separately from the state's average annual pay statistic. To hold the state name as while loops through the file, we first assign a scalar to store that state name:
## remember which state we're examining
Then, at each line of data, we determine if it contains a state name or a statistic. If the line contains the state name, we edit the line to extract that name. If the line contains a statistic, we identify the year and value, remove preliminary tags, (P), and pass the information to the hash:
## read in the data, skipping first eight lines
More details on how I assembled the minimum wage dataset can be found my Perl script. And in our discussion of regressions, we will explore the assembled dataset and attempt to measure the effect of the minimum wage on employment while controlling for the effects of other variables, like inflation and average annual pay.