Friday, July 5, 2013

Querying JSON records via Hive


/* ---[ Opacity: A brief rant ]--- */

Despite the popularity of Hadoop and its ecosystem, I've found that much of it is frustratingly underdocumented or at best opaquely documented. An example proof of this is the O'Reilly Programming Hive book, whose authors say they wrote it because so much of Hive is poorly documented and exists only in the heads of its developer community.

But even the Programming Hive book lacks good information on how to effectively use Hive with JSON records, so I'm cataloging my findings here.


/* ---[ JSON and Hive: What I've found ]--- */

I've only been playing with Hive about two weeks now, but here's what I found with respect to using complex JSON documents with Hive.

Hive has two built-in functions, get_json_object and json_tuple, for dealing with JSON. There are also a couple of JSON SerDe's (Serializer/Deserializers) for Hive. I like this one the best: https://github.com/rcongiu/Hive-JSON-Serde

I will document using these three options here.

Let's start with a simple JSON document and then move to a complex document with nested subdocuments and arrays of subdocuments.

Here's the first document:

{
    "Foo": "ABC",
    "Bar": "20090101100000",
    "Quux": {
        "QuuxId": 1234,
        "QuuxName": "Sam"
    }
}

We are going to store this as a Text document, so it is best to have the whole JSON entry on a single line in the text file you point the Hive table to.

Here it is on one line for easy copy and pasting:

{"Foo":"ABC","Bar":"20090101100000","Quux":{"QuuxId":1234,"QuuxName":"Sam"}}

Let's create a Hive table to reference this. I've put the above document in a file called simple.json:

CREATE TABLE json_table ( json string );

LOAD DATA LOCAL INPATH  '/tmp/simple.json' INTO TABLE json_table;

Since there are no delimiters, we leave off the ROW FORMAT section of the table DDL


Built in function #1: get_json_object

The get_json_object takes two arguments: tablename.fieldname and the JSON field to parse, where '$' represents the root of the document.

select get_json_object(json_table.json, '$') from json_table; 

Returns the full JSON document.

So do this to query all the fields:

select get_json_object(json_table.json, '$.Foo') as foo, 
       get_json_object(json_table.json, '$.Bar') as bar,
       get_json_object(json_table.json, '$.Quux.QuuxId') as qid,
       get_json_object(json_table.json, '$.Quux.QuuxName') as qname
from json_table;

You should get the output:

foo    bar              qid     qname
ABC    20090101100000   1234    Sam

(Note: to get the header fields, enter set hive.cli.print.header=true at the hive prompt or in your $HOME/.hiverc file.)

This works and has a nice JavaScript like "dotted" notation, but notice that you have to parse the same document once for every field you want to pull out of your JSON document, so it is rather inefficient.

The Hive wiki recommends using json_tuple for this reason.


Built in function #2: json_tuple

So let's see what json_tuple looks like. It has the benefit of being able to pass in multiple fields, but it only works to a single level deep. You also need to use Hive's slightly odd LATERAL VIEW notation:

select v.foo, v.bar, v.quux, v.qid 
from json_table jt
     LATERAL VIEW json_tuple(jt.json, 'Foo', 'Bar', 'Quux', 'Quux.QuuxId') v
     as foo, bar, quux, qid;

This returns:

foo  bar             quux                              qid
ABC  20090101100000  {"QuuxId":1234,"QuuxName":"Sam"}  NULL

It doesn't know how to look inside the Quux subdocument. And this is where json_tuple gets clunky fast - you have to create another lateral view for each subdocument you want to descend into:

select v1.foo, v1.bar, v2.qid, v2.qname 
from json_table jt
     LATERAL VIEW json_tuple(jt.json, 'Foo', 'Bar', 'Quux') v1
     as foo, bar, quux
     LATERAL VIEW json_tuple(v1.quux, 'QuuxId', 'QuuxName') v2
     as qid, qname;

This gives us the output we want:

foo  bar             qid   qname
ABC  20090101100000  1234  Sam

With a complicated highly nested JSON doc, json_tuple is also quite inefficient and clunky as hell. So let's turn to a custom SerDe to solve this problem.


The best option: rcongiu's Hive-JSON SerDe

A SerDe is a better choice than a json function (UDF) for at least two reasons:

  1. it only has to parse each JSON record once
  2. you can define the JSON schema in the Hive table schema, making it much easier to issue queries against.

I reviewed a couple of SerDe's and by far the best one I've found is rcongiu's Hive-JSON SerDe.

To get that SerDe, clone the project from GitHub and run mvn package. It creates a json-serde-1.1.6.jar in the target directory. If you have a place you like to put your jars for runtime referencing move it there.

Then tell Hive about it with:

ADD JAR /path/to/json-serde-1.1.6.jar;

You can do this either at the hive prompt or put it in your $HOME/.hiverc file.

Now let's define the Hive schema that this SerDe expects and load the simple.json doc:

CREATE TABLE json_serde (
  Foo string,
  Bar string,
  Quux struct<QuuxId:int, QuuxName:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH '/tmp/simple.json' INTO TABLE json_serde;

With the openx JsonSerDe, you can define subdocuments as maps or structs. I prefer structs, as it allows you to use the convenient dotted-path notation (e.g., Quux.QuuxId) and you can match the case of the fields. With maps, all the keys you pass in have to be lowercase, even if you defined them as upper or mixed case in your JSON.

The query to match the above examples is beautifully simple:

SELECT Foo, Bar, Quux.QuuxId, Quux.QuuxName
FROM json_serde;

Result:

foo  bar             quuxid  quuxname
ABC  20090101100000  1234    Sam



And now let's do a more complex JSON document:

{
  "DocId": "ABC",
  "User": {
    "Id": 1234,
    "Username": "sam1234",
    "Name": "Sam",
    "ShippingAddress": {
      "Address1": "123 Main St.",
      "Address2": null,
      "City": "Durham",
      "State": "NC"
    },
    "Orders": [
      {
        "ItemId": 6789,
        "OrderDate": "11/11/2012"
      },
      {
        "ItemId": 4352,
        "OrderDate": "12/12/2012"
      }
    ]
  }
}

Collapsed version:

{"DocId":"ABC","User":{"Id":1234,"Username":"sam1234","Name":"Sam","ShippingAddress":{"Address1":"123 Main St.","Address2":"","City":"Durham","State":"NC"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},{"ItemId":4352,"OrderDate":"12/12/2012"}]}}

Hive Schema:

CREATE TABLE complex_json (
  DocId string,
  User struct<Id:int,
              Username:string,
              Name: string,
              ShippingAddress:struct<Address1:string,
                                     Address2:string,
                                     City:string,
                                     State:string>,
              Orders:array<struct<ItemId:int,
                                  OrderDate:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

Load the data:

    LOAD DATA LOCAL INPATH '/tmp/complex.json' 
    OVERWRITE INTO TABLE complex_json;

First let's query something from each document section. Since we know there are two orders in the orders array we can reference them both directly:

SELECT DocId, User.Id, User.ShippingAddress.City as city,
       User.Orders[0].ItemId as order0id,
       User.Orders[1].ItemId as order1id
FROM complex_json;

Result:

docid  id    city    order0id  order1id
ABC    1234  Durham  6789      4352

But what if we don't know how many orders there are and we want a list of all a user's order Ids? This will work:

SELECT DocId, User.Id, User.Orders.ItemId
FROM complex_json;

Result:

docid  id    itemid
ABC    1234  [6789,4352]

Oooh, it returns an array of ItemIds. Pretty cool. One of Hive's nice features.

Finally, does the openx JsonSerDe require me to define the whole schema? Or what if I have two JSON docs (say version 1 and version 2) where they differ in some fields? How constraining is this Hive schema definition?

Let's add two more JSON entries to our JSON document - the first has no orders; the second has a new "PostalCode" field in Shipping Address.

{
    "DocId": "ABC",
    "User": {
        "Id": 1235,
        "Username": "fred1235",
        "Name": "Fred",
        "ShippingAddress": {
            "Address1": "456 Main St.",
            "Address2": "",
            "City": "Durham",
            "State": "NC"
        }
    }
}

{
    "DocId": "ABC",
    "User": {
        "Id": 1236,
        "Username": "larry1234",
        "Name": "Larry",
        "ShippingAddress": {
            "Address1": "789 Main St.",
            "Address2": "",
            "City": "Durham",
            "State": "NC",
            "PostalCode": "27713"
        },
        "Orders": [
            {
                "ItemId": 1111,
                "OrderDate": "11/11/2012"
            },
            {
                "ItemId": 2222,
                "OrderDate": "12/12/2012"
            }
        ]
    }
}

Collapsed version:

{"DocId":"ABC","User":{"Id":1235,"Username":"fred1235","Name":"Fred","ShippingAddress":{"Address1":"456 Main St.","Address2":"","City":"Durham","State":"NC"}}}
{"DocId":"ABC","User":{"Id":1236,"Username":"larry1234","Name":"Larry","ShippingAddress":{"Address1":"789 Main St.","Address2":"","City":"Durham","State":"NC","PostalCode":"27713"},"Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}}


Add those records to complex.json and reload the data into the complex_json table.

Now try the query:

SELECT DocId, User.Id, User.Orders.ItemId
FROM complex_json;

It works just fine and gives the result:

docid  id    itemid
ABC    1234  [6789,4352]
ABC    1235  null
ABC    1236  [1111,2222]

Any field not present will just return null, as Hive normally does even for non-JSON formats.

Note that we cannot query for User.ShippingAddress.PostalCode because we haven't put it on our Hive schema. You would have to revise the schema and then reissue the query.


/* ---[ A tool to automate creation of Hive JSON schemas ]--- */

One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a large complex, highly nested JSON document is quite tedious.

I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.