Tutorial: Using Patterns to Describe Input Data

(Note: Anywhere you see something like send a "foo" command to the Musi-Cal command processor it means to send an email message to concerts@musi-cal.com with a message body of "foo".)

The most common form that tour itineraries take are multiple lines where each lines represents another stop on the tour. The default and convert commands of the Musi-Cal command processor are used to convert this form of data into add commands that can be used to add new entries to the database.

If you do not already have a basic understanding of adding (posting) dates to Musi-Cal, it would be good to do so before learning about patterns. You can read the tutorial on submitting events to Musi-Cal or send a "help submit" command to the Musi-Cal command processor.

If you're only posting an occasional entry stop here... You will find it simpler to just type out each entry as its own add command. If you plan to submit frequent & lengthy itineraries, however, the time spent acquainting yourself with the converter will be well worth the investment....read on.

After you've read this, if it seems like more work than you are ready for, you might consider letting us to the hard stuff. Read about itinerary@musi-cal.com. You can fetch an email version of that tutorial by sending a "help itinerary" command to the Musi-Cal command processor.

This guide focuses on the most important part (and the hardest part to master) of the conversion process, creating patterns that describe the input lines.

Basic Patterns

When converting raw data that lists each event on a single line of your itinerary, you must describe to the processor what pattern it will be reading. The default command and pattern commands do most of the work. Most of the default commands do exactly what you'd expect: they dictate the information that will be standard for each line of your entries, allowing the pattern field to interpret the lines with variable data. Each part of the pattern you will create describes one of four items:

In other words every letter, digit,character, or space that is in the line of your data in its raw form will need to be introduced to the command processor by way of the pattern you will create.........it's easier than it sounds!

Here's a look at what a default/convert entry might look like: (again if some of these commands are unfamiliar it would be worthwhile to spend time with the "help submitting" document mentioned earlier)

    default
    docaps 
    performer Olsen, Kristina
    keyword acoustic, contemporary, folk, blues
    dsyear 1995
    pattern"%{smonth}/%{sday}/%{syr}","%{venue}","%{city}","%{st}\","%{time}","%{info}",
    end

    convert
    "8/19/95","WILDCAT RANCH","NAVARRE","OH","8PM","216-555-1212",
    "8/31/95","STRAWBERRY FESTIVAL","YOSEMITE","CA","7PM","888-555-1212",
    "9/17/95","MILL POND FEST","BISHOP","CA","4pm","800-555-1212",
    "10/28/95","NEW FOLK COLLECTIVE","ST. PAUL ","MN","8PM","612-555-1212",
    end

The patterns used by the Musi-Cal command processor can be used to match many different types of input. In the example above the commands between default and the first end describe the pattern of the raw data that is between convert and the second end (which in this example was exported from a Filemaker Pro data base, thus the extra quotation marks). More on this later. In a simpler example your itinerary may reside in your computer like this:

    Sat     6/24   Greenwich, CT           Roger Sherman Baldwin Park
represents a concert that will take place on Saturday, June 24th in Greenwich, Connecticut at the Roger Sherman Baldwin Park. Lines of this form can be described by the pattern
    %{alpha} %{smonth}/%{sday} %{city}, %{st} %{venue}

(Note that in both of these examples all of the raw data refers to the itinerary of one artist and therefore the default commands refer to the particulars of that artist.) The %{alpha} tag matches a string of letters, in this case "Sat", the day of the week. %{alpha} is an example of a match for a field whose value we will ignore. The space after %{alpha} matches one or more space or tab characters in the line. The %{smonth} tag matches a number that corresponds to the numeric month of the year for the starting date of the event. The '/' character matches a literal '/' in the input. The %{sday} tag matches a number that corresponds to the day of the month for the starting date of the event. Following that is more white space. The %{city} tag matches an arbitrary number of characters of any type. The comma after the %{city} tag tells the matcher to stop matching characters when a comma is found. This implies that commas may not appear in the city name. More white space follows the comma. The %{st} tag matches one or more letters that represents an abbreviation of a US state or a Canadian province. That is again followed by white space. The final %{venue} tag, like the %{city} tag can match any characters. Since there is nothing else following it, the \%{venue} tag will match the remainder of the line.

In the patterns you will be creating the "tags" (items between the curly brackets) are the labels assigned by the program to interpret certain standardized elements of raw data. Each tag has some predetermined parameters, for instance it may indicate that the converter ignore the data in that position (as in an {alpha} tag), it might read numerals only, letters only, etc... if you stick to using the pre-assigned tags on this list you will not need to concern yourself with this.

The following is a list of {tags}:

    tag		description			characters
    ------------------------------------------------------
    performer	performer name (in sorting	anything
		order!) 
    type	music type (rock, jazz,		letters, comma, spaces, &
		folk, classical, etc)
    keywords    other descriptors (symphonic,   letters, comma, spaces, &
                early music, a cappella, etc)
    city	city of the event		anything
    state	state name spelled out		letters, spaces
    province	province name spelled out	letters, spaces
    country	country spelled out		letters, spaces
    st		abbreviated state name -	letters
		always printed upper case 
    prov	abbreviated province name	letters
		- always printed upper
		case 
    cty		abbreviated country name -	letters
		always printed upper case 
    venue	venue name			anything
    info	concert information		anything
    program	concert program			anything
    sday	day of the month for the	digits
		start of the event 
    smonth	numeric month (1..12) for	digits
		the start of the event 
    syear	four-digit year of the		digits
		start (e.g., '1995') 
    syr		two-digit year of the		digits
		start (e.g., '95') 
    Smon	abbreviated month name for	letters
		the start (e.g., Jan) 
    Smonth	full month name for the		letters
		start (e.g., January) 
    eday	day of the month for the	digits
		end of the event 
    emonth	numeric month (1..12) for	digits
		the end of the event 
    eyear	four-digit year of the end	digits
		(e.g., '1995') 
    eyr		two-digit year of the end	digits
		(e.g., '95') 
    days        dates or date ranges,           digits, comma, spaces
                separated by commas (e.g.,
                '21, 22, 25-26')
    Emon	abbreviated month name for	letters
		the end (e.g., Jan) 
    Emonth	full month name for the		letters
		end (e.g., January) 
    number	one or more digits with no	digits
		intervening white space 
    alpha	one or more letters with	letters
		no intervening white space
    string	one or more letters,		anything
		digits or punctuation
		characters with possible
		embedded white space
    date        "Smart" date parsing            anything
    location    "Smart" address/city/state      anything
	        parsing

If you look back at the pattern examples above you will see that not only do the patterns describe the obvious tag elements but they also must mimic the presence of punctuation marks and spaces. Stay with us! This is truly only difficult the first few times!

Properly Delimiting Fields

Learning to create correct patterns takes a little practice. Many tags will match any input up until the delimiter character that immediately follows the tag. For instance, city names can contain just about anything. In particular, they can contain spaces. The input

    New York NY Shea Stadium
is easy for humans to break into its three pieces (because we apply meaning to the words as we are matching patterns), but impossible for the pattern matcher. The pattern
    %{city} %{st} %{venue}
might seem like it should work. However, since city names can contain just about anything, that tag is terminated by the first character that follows the tag, in this case, a space. Therefore, %{city} would only match "New". The pattern matcher knows that a state abbreviation as represented by the \%{st} tag can only contain letters, so %{st} would match the next word, "York". Finally, since the %{venue} field can (like the %{city} tag) contain anything, it matches the rest of the line, "NY Shea Stadium".

The solution is to properly terminate those tags that can contain arbitrary characters if they do not represent the last field in the input. Adding a comma after the city makes the above input and pattern correct:

    New York, NY Shea Stadium
and
    %{city}, %{st} %{venue}

The tags that can match any characters and must delimited by something other than white space are:

%{performer} performer name
%{city} city name
%{venue} venue name
%{program} concert program information
%{info} other concert information

Some other tags can contain white space, so should not be terminated by spaces:

%{keywords} other keywords describing the music
%{state} US state name spelled out
%{province} Canadian province name spelled out
%{country} other country name spelled out
%{days} range of days

Returning again to our Shea Stadium example, if you have the input

    New York, New York Shea Stadium
the pattern
    %{city}, %{state} %{venue}
will not match properly. The %{state} tag will match all but the last word on the line (sort of a bone the pattern matcher throws to the %{venue} tag - it tries very hard to make matches), leaving a state called "New York Shea", which does not match a state known to the later stages when meaning is applied to the matches.

Optional Elements

You can create patterns that contain optional elements. This allows you to match a broader class of input lines with a single pattern than would otherwise be possible. For instance, suppose you have the following raw input you want to convert:

    Sat     6/24   Greenwich, CT           Roger Sherman Baldwin Park
    Tue     6/27   Toronto, ON            Ultrasound Show Bar
    Thu-Fri 6/29-30 Lancaster, PA           Chameleon Club
The pattern described earlier will match the first two lines, but not the third. It will fail on two counts. First, the %{alpha} tag does not match the string "Thu-Fri". Second, the %{sday} tag does not match the string "29-30". We can handle these lines in a number of different ways. First we will describe the use of optional tags.

An optional tag looks like a normal tag except the first character after the opening curly brace is a question mark. For instance, the don't care string "Thu-Fri" can be matched by the tag string "%{alpha}%{?-%{alpha}}". This introduces three new concepts that greatly increase the flexibility of the pattern matcher:

  1. The use of the "?" to introduce an optional tag
  2. The use of literal characters within a tag
  3. The use of nested tags

The tag %{?-%{alpha}} matches an optional hyphen character followed by one or more alphabetic characters. Note that if the hyphen occurs, the alphabetic character(s) must also occur.

To match the starting day and an optional ending day separated by a hyphen, you can use the tags "%{sday}%{?-%{eday}}". This will match both the "24" and "29-30" day strings in the example input.

Splitting Input

In some situations it is difficult or impossible to use optional tags to create one pattern that matches a range of different inputs. In that case it is best to divide the input into two or more smaller sets and describe simpler patterns that handle each case. This approach could have been used above where we used optional elements. As another example, suppose we have the following two lines of input:

    6/24   Greenwich, CT           Roger Sherman Baldwin Park
    7/5    London, ENG             Wembley Stadium

The number of fields in both lines is the same, but the fourth field in the first is a US state abbreviation while the fourth field in the second is an abbreviation for England. The pattern matcher can't tell the difference between a state abbreviation and a country abbreviation, so distinguishing the two is not possible. (At the early stage where the pattern matching is done, nothing is known about the meaning of the patterns being matched. Both state and country abbreviations are represented by one or more alphabetic characters. It is only later after the pattern has been matched successfully that meaning is given to each piece of input.)

Simply divide the input into two parts and generate two separate patterns,

    %{smonth}/%{sday} %{city}, %{st} %{venue}
for the first, and
    %{smonth}/%{sday} %{city}, %{cty} %{venue}
for the second. (Note: The above example is probably not a good one any more. During the later stage of conversion, whatever is matched by the cty tag will be considered as a country abbreviation, and if that fails, as a US state/Canadian province abbreviation, so the example above should be handled properly by the second pattern.)

A Simpler Way to Handle Day Ranges

The use of commas and hyphens to separate events that occur repeatedly is so common that a special tag, %{days}, was created to handle day ranges like:

    2,3,4
    2-4
    2,3,5-7
    7,14,21

This is a third alternative for handling part of the earlier pattern that contained a day range. Instead of using

    %{sday}%{?-%{eday}}
to match both the individual dates as well as the two-day gig, we could have simply used
    %{days}

The %{days} tag is quite flexible. If you give a day range of "2,3,4" it will generate a single event instead of three separate ones. On the other hand, the day range "7,14,21" will cause three separate entries to be generated, since the days are not consecutive. The %{days} tag will not handle day ranges that are not in strictly ascending order, however. "4,3,2" or "1,2,2,3" will both break things.

Recommendations

The creators of Musi-Cal have used the default and convert commands to convert many tour itineraries into sets of add commands for entry into the database. Our task is complicated by the fact that we have to convert data from lots of people, so we are constantly creating new patterns. Most people submitting data to Musi-Cal will normally just add information for the same group of artists, however. This makes your task easier. If you can settle on a single format that works for you and for which you can generate a pattern, so much the better. Simply save that pattern and use it repeatedly to convert new itineraries you receive.

Many music industry professionals work with tour information in databases. They can often generate plain ASCII output that is ugly to read, but very easy to generate patterns for. If your database system can generate output that is quoted and separated by commas:

    "7/1/95","HIGH SIERRA MUSIC FESTIVAL","BEAR VALLEY","CA","1-510-420-1529"
you're home free. If it generates tab-delimited output, simply replace the tab characters by some other non-white-space character that doesn't occur in the data. '~', '%', '^' or '=' are reasonable candidates to replace tabs.

Some mail systems love to break messages into lines that don't contain more than 70 or 75 characters (AOL's mail system appears to do this). This can be frustrating to people trying to use the conversion commands, since patterns can get long. If you have a long pattern, simply split it onto two (or more) lines and terminate all but the last line with backslashes (\), e.g: pattern "%{smonth}/%{sday}/%{syr}","%{venue}","%{city}","%{st}",%{?"%{info}"} When the lines are first read in, if a line ends in a backslash, the next line is simply appended to make one longer line. No white space is inserted, so you can break the line anywhere you please.

You can now continue many input lines without a backslash. If you are in a context where a keyword is being expected (such as within an add, default, get or edit command), you can usually just continue the field you are entering on the next line. If the first word of the current line is not recognized as a keyword in that context, the line is appended to the previous line, separated by a single space. There are three situations where you must still use a backslash to continue a line:

  1. If it is required that no space be inserted at the place where the line break occurs, you must still use a backslash to terminate the line (as in the pattern field example above).
  2. If the first word of the continuation line is a Musi-Cal keyword in the context in which it occurs, you must still use a backslash to terminate the line. For instance, in the following contrived add command, a backslash must be used to continue the info field:
          add
             performer Clampett,Daisy Mae
             info For more           info, contact Jethro at (213)555-1212
          end
    
    because the continuation line begins with the word "info", which is a keyword in that context.
  3. You must always continue raw input lines inside the convert command with a backslash. The convert command recognizes no keywords other than isolated word "end" on a line by itself.

Now you should give it a try. Craft your first pattern. Include it in a default command entry, throw in some raw data after the convert command and send it all off to concerts@musi-cal.com. If you are successful the first time Musi-Cal will return a message to you that looks something like this : (a response email may take up to an hour, and may look a tad frightening when it does arrive!)

# The following output should be reviewed and corrected if necessary before
# sending back to concerts@musi-cal.com

# Lines that are not followed immediately by a submission comment or warning
# did not match the pattern.
# default
#   types 
#   pattern %{smonth}/%{sday}/%{syr} %{city}, %{country},%{? USA,} %{venue}
#   keywords .genodelafose-frenchrockinboogie.224
#   performers Delafose,Geno/French Rockin' Boogie
#   info 
# end
# <HTML><HEAD>
# <TITLE>Rosebud/Geno Delafose/Itinerary-Text</TITLE></HEAD><BODY>
# <H3>Geno Delafose & The French Rockin' Boogie Itinerary</H3>
# Updated on October 23, 1996<P>
# 10/24/96   Opelousas, LA, USA, Yambilee Building
### Date passed - not generating add command...
# 10/26/96   Lafayette, LA, USA, Hamilton's Club
  # Above line(s) converted cleanly and submitted
# 10/30/96   Istanbul, TURKEY, Efes Pilsen Blues Festival
  # Above line(s) converted cleanly and submitted
# 10/31/96   Istanbul, TURKEY, Efes Pilsen Blues Festival
  # Above line(s) converted cleanly and submitted
# 11/1/96   Istanbul, TURKEY, Efes Pilsen Blues Festival
  # Above line(s) converted cleanly and submitted
# 11/2/96   Istanbul, TURKEY, Efes Pilsen Blues Festival
  # Above line(s) converted cleanly and submitted
# 11/4/96   Izmir, TURKEY, Efes Pilsen Blues Festival
add
    performer Delafose,Geno/French Rockin' Boogie
    keywords .genodelafose-frenchrockinboogie.224
    city Izmir
    country Turkey
    ### Warning: Can't find lat/long for Izmir, .  Double-check the spelling.
    venue Efes Pilsen Blues Festival
    date 4 November 1996
end
clear
...

What you do with this mess is simple...glean & copy all the "add" thru "clear" paragraphs and send them in a new piece of email back to the concerts@musi-cal.com. You can often do this by simply replying to the message, though you should be careful to check that the reply will be sent to concerts@musi-cal.com instead of concertmaster@musi-cal.com. The most common problem is that we lack latitude/longitude information for many non-US cities, so you'll receive a warning like the one above for Izmir, Turkey. Another possible problem is that some of your fields are missing data, which will cause the pattern matcher to not recognize a line it should have matched.

Send off your newly sliced, diced, coleslawed, gig-o-matic results and go out for an ice cream sundae! Musi-Cal will return to you a happy little message that looks something like this:

Successfully added entry for Olsen, Kristina on July 1 1995.
Successfully added entry for Olsen, Kristina on July 7 1995.
Successfully added entry for Olsen, Kristina on July 8 1995.
.....etc
More than likely on your first few tries you will make mistakes defining the pattern and some or all of your lines won't match the pattern. If some lines are matched and others aren't, very carefully compare one that matches with one that doesn't. More than likely you'll discover a missing punctuation mark or white space in the non-matching line.

Once you've got this converting process down, you're ready to learn about editing the entries you submit. You can read the Web version of the tutorial on editing entries or send a "help edit" command to the Musi-Cal command processor.

[Musi-Cal Home Page] Contact Us!
Copyright © 2007 Wolfgang's Vault