Friday, October 11, 2013

Launching a Cascading job from Apache Oozie

The Cascading framework has its own workflow management system embedded in it, so when I tried to find information online about how to launch a Cascading job from within the Apache Oozie workflow scheduler tool, I found a dearth of information.

In fact, when I asked on the oozie-users mailing list how to do it, the only response I got back was to write an Oozie extension to run Cascading jobs. That may be the right solution long term (don't know enough yet), but I did find a way to get it working with what Oozie provides today.


/*---[ Failed attempts ]---*/

I tried unsuccessfully to use the map-reduce action and the shell action. The former won't work because it wants you to specify the Mapper and Reducer classes explicitly. That doesn't make sense in a Cascading job - you launch your main Cascading class and it auto-generates a bunch of mappers and reducers. Secondly, while you can use the oozie.launcher.action.main.class property and specify your main Cascading class, there seems to be no way to pass arguments to it.

I'm not sure why I couldn't get the shell action to work. I made the exec property /usr/bin/hadoop in order to run it as hadoop jar myjar.jar com.mycompany.MyClass arg1 arg2 argN, but several attempts to make that work failed. There probably is a way to make it work, however.


/*---[ Solution: use the java action ]---*/

In order to launch Cascading jobs, we build an uber-jar (which maven annoyingly calls a shaded jar) that has our specific Cascading code and supporting objects, as well as the Cascading library all bundled in it. But that's not enough as all that depends on the myriad Hadoop jars. We then use the hadoop jar invocation as I indicated above because it puts all the Hadoop jars in the classpath.

I didn't think using the Oozie java action would work unless I built a massive uber jar with all the Hadoop dependencies which then have to get farmed around the Hadoop cluster each time you run it -- a great waste.

But I was happily surprised to notice that Oozie sets up the classpath for java (and map-reduce) tasks with all the Hadoop jars present.

So, here's the workflow.xml file that works:

<workflow-app xmlns='uri:oozie:workflow:0.2' name='cascading-wf'>
  <start to='stage1' />
  <action name='stage1'>
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>

      <configuration>
        <property>
          <name>mapred.job.queue.name</name>
          <value>${queueName}</value>
        </property>
      </configuration>

      <main-class>com.mycompany.MyCascade</main-class>
      <java-opts></java-opts>
      <arg>/user/myuser/dir1/dir2</arg>
      <arg>my-arg-2</arg>
      <arg>my-arg-3</arg>
      <file>lib/${EXEC}#${EXEC}</file> 
      <capture-output />
    </java>
    <ok to="end" />
    <error to="fail" />
  </action>


  <kill name="fail">
    <message>FAIL: Oh, the huge manatee!</message>
  </kill>

  <end name="end"/>
</workflow-app>

The parameterized variables, such as ${EXEC}, are defined in a job.properties in the same directory as the workflow.xml file. The shaded jar is in a lib subdirectory as indicated.

  
 nameNode=hdfs://10.230.138.159:8020
 jobTracker=http://10.230.138.159:50300
  
 queueName=default
  
 oozie.wf.application.path=${nameNode}/user/${user.name}/examples/apps/cascading
 EXEC=mybig-shaded-0.0.1-SNAPSHOT.jar

Let me know if you find another way to launch a Cascading job from Oozie or find any problems with this solution.

Tuesday, October 8, 2013

Beautiful Code Ported to Go

This week I've been learning Go - the programming language, not the game. I had studied its concurrency primitives for my Clojure library that would bring the CSP model to Clojure (this was before Rich Hickey and crew created core.async), but until a few days ago I hadn't formally studied the whole of Go with the intention of being proficient in it.

Go has pointers, but it does not have pointer arithmetic. Instead, it has slices - variable sized arrays on which you can use Python-like "slice" notation. I wanted a chance to try that out and found it recently when reading Chapter 1 of Beautiful Code, which is about a limited regular expression that Rob Pike (co-creator of Go) wrote for the Practice of Programming book he co-wrote with Brian Kernighan (who is also the author of Ch. 1 of Beautiful Code).

Pike's code is a limited (pedagogical) regex library that allows the following notation:

|------------+------------------------------------------------------------|
| Character  | Meaning                                                    |
|------------+------------------------------------------------------------|
| c          | Matches any literal character c                            |
| . (period) | Matches any single character                               |
| ^          | Matches the beginning of the input string.                 |
| $          | Matches the end of the input string                        |
| *          | Matches zero or more occurrences of the previous character |
|------------+------------------------------------------------------------|


/* ---[ The C version ]--- */

Here's Pike's code in C:


 #include <stdio.h>

 int matchstar(int c, char *regexp, char *text);

 /* matchhere: search for regexp at beginning of text */
 int matchhere(char *regexp, char *text) {
   if (regexp[0] == '\0')
     return 1;
   if (regexp[1] == '*')
     return matchstar(regexp[0], regexp+2, text);
   if (regexp[0] == '$' && regexp[1] == '\0')
     return *text == '\0';
   if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
     return matchhere(regexp+1, text+1);
   return 0;
 }


 /* matchstar: search for c*regexp at beginning of text */
 int matchstar(int c, char *regexp, char *text) {
   do {
     /* a * matches zero or more instances */
     if (matchhere(regexp, text))
       return 1;
   } while (*text != '\0' && (*text++ == c || c == '.'));
   return 0;
 }


 /* match: search for regexp anywhere in text */
 int match(char *regexp, char *text) {
   if (regexp[0] == '^')
     return matchhere(regexp+1, text);
   do {
     /* must look even if string is empty */
     if (matchhere(regexp, text))
       return 1;
   } while (*text++ != '\0');
   return 0;
 }

I'll let you read Ch. 1 of Beautiful Code for an analysis, but two things are noteworthy for my purposes:

  1. Pike uses pointer arithmetic throughout the code
  2. He uses the unusual do-while loop twice in only 30 or so lines of code

So I thought I'd port it to Pike's new language Go.


/* ---[ My Go version ]--- */


 package pikeregex

 // search for c*regex at beginning of text
 func matchstar(c rune, regex []rune, text []rune) bool {
     for {
         if matchhere(regex, text) {
             return true
         }
         if ! (len(text) > 0 && (text[0] == c || c == '.')) {
             return false
         }
         text = text[1:]
     }
 }

 // search for regex at beginning of text
 func matchhere(regex []rune, text []rune) bool {
     if len(regex) == 0 {
         return true
     }
     if len(regex) > 1 && regex[1] == '*' {
         return matchstar(regex[0], regex[2:], text)
     }
     if regex[0] == '$' && len(regex) == 1 {
         return len(text) == 0
     }
     if len(text) > 0  && (regex[0] == '.' || regex[0] == text[0]) {
         return matchhere(regex[1:], text[1:])
     }
     return false
 }

 // search for regex anywhere in the text
 func Match(regex string, text string) bool {
     runerx := compile(regex)
     runetxt := []rune(text)

     if len(runerx) > 0 && runerx[0] == '^' {
         return matchhere(runerx[1:], runetxt)
     }

     for {
         if matchhere(runerx, runetxt) {
             return true
         }
         if len(runetxt) == 0 {
             return false
         }
         runetxt = runetxt[1:]
     }
 }

 // one enhancement: allow + (1 or more) notation
 func compile(regex string) (regslc []rune) {
     regslc = make([]rune, 0, len(regex) + 10)

     for _, r := range regex {
         if r == '+' {
             regslc = append(regslc, regslc[len(regslc) - 1], '*')
         } else {
             regslc = append(regslc, r)
         }
     }   
     return regslc
 }

This is as straight a port as I could make it. And I think it translates well to Go. I've capitalized the Match method, as that is the public one to be exported to other libraries.

Instead of pointer arithmetic I used slice notation, as in this recursive call to matchhere:


 // C version
 return matchhere(regexp+1, text+1);

 // Go version
 return matchhere(regex[1:], text[1:])

Also in the C code you check whether you are at the end of the text string by looking for the NUL char: *text == '\0'. In Go, you can use the builtin len function: len(text) == 0. That statement is true if you keep recursively slicing text[1:] until you get to an empty string, or rather in my code, an empty slice of runes.


/* ---[ Runeology ]--- */

Runes are the Go 'char' type. A rune is an integer value identifying a Unicode code point. When you iterate over strings, you get runes, which are of variable size (number of bytes).

You have to be careful with strings in Go: text[2] returns the third byte, not the third rune in the string. If you want the third rune, you might try to use the utf8.DecodeRuneInString(text[2:]) function. But this would only work with ASCII, as you are slicing at the third byte and asking the utf8 library to parse the first rune from that point. But if the first rune in the string is two bytes long, you'll be getting the second rune in the string, not the third. If it's three bytes long, you're really in trouble.

The safest way is to do what I did in the code: convert the string to a slice of runes ([]rune) immediately and then work it that. Now when you index runeslice[2] you know you are getting the third rune.


/* ---[ No do-while ]--- */

Go doesn't have a do-while loop. It doesn't even have a while statement: just for. But, as Rob Pike reminded me in a critique of the first version of this blog entry, a do-while can be adequately mimicked with an infinite for loop:

 func matchstar(c rune, regex []rune, text []rune) bool {
     for {
         if matchhere(regex, text) {
             return true
         }
         if ! (len(text) > 0 && (text[0] == c || c == '.')) {
             return false
         }
         text = text[1:]
     }
 }

The stated intent of Go was to be as minimal as possible. Pike, in a recent podcast interview, said that the core team that created Go (which includes Ken Thompson) all had to agree that a feature was essential for it be included. Many candidate features were dropped, including the do-while loop. Of note, goto was not, which I find quite interesting. goto is only mentioned once (almost in passing) in the Effective Go guide, so I'm interested in what the essential use case for it was.


/* ---[ One addition ]--- */

Finally, in the Beautiful Code chapter, Kernighan suggests a number of enhancements the reader can make. I've only done one - allowing the + (1 or more) operator by mildly precompiling the regex, turning x+ into xx*, allowing me to use Pike's original (ported) code untouched.

The above code is available on GitHub: https://github.com/midpeter444/pikeregex