Quantcast
Channel: Mauro Krikorian » NoSql
Viewing all articles
Browse latest Browse all 7

Hive queries on complex JSON

$
0
0

Lately, I’ve been immersed on some Big Data projects. One of the issues I had to solve was related to getting logs from JSON using a Hive query. You can find many resources on the web showing how to deal with JSON on a Hive query, but there aren’t many good examples for very custom data. As starting point if you want to handle JSON you can use one of the SerDe packages around that will help you to deal with it (for example hive-json-serde project you can find at Google’s code). This SerDe is pretty cool to parse simple JSON objects, but what if you need to deal with arrays, maps or nested complex structures?

Then you can continue your research and find another cool project that allows parsing arrays and nested structures within JSON objects, and also works with arrays/maps from the JAVA world when you might need to serialize your data. You can take a look at this project: rcongiu/Hive-JSON-Serde.

This project is great to deal with nested structures and arrays within JSON objects as shown below:

{ "country" : "Switzerland", "languages" : ["German", "French", "Italian"], "religions" : { "catholic" : [10,20], "protestant" : [40,50] } }

but also rows that are directly JSON arrays are supported – for example:

[{ "blogID" : "FJY26J1333", "data" : "20120401", "name" : "vpxnksu" }]
[{ "blogID" : "VSAUMDFXFD", "data" : "20120401", "name" : "yhftrcx" }]

As you’ve seen, this one solves lots of the issues you could find, but what if your data is customized in a way that you couldn’t?

Let’s suppose, as I had, you want to deal with a JSON array at root level containing N JSON objects. Then, how you can deal with it?

Consider the example below:

[{ "blogID" : "FJY26J1333", "data" : "20120401", "name" : "vpxnksu" }, { "blogID" : "VSAUMDFXFD", "data" : "20120401", "name" : "yhftrcx" }]

As you dive deep into rcongiu/Hive-JSON-Serde code you see that for each record received as input to be processed, you get one record as output: a JAVA object that Hive can manipulate. This leads to getting one row with the first object ({ “blogID” : “FJY26J1333″, “data” : “20120401″, “name” : “vpxnksu” } – if your input is the example above).Why is this happening? Because the default mapper for the file format is used in this case and it’s interpreting each line as an input row for the SerDe within the mapper stage. What you can do to workaround this issue? Well, one simple answer would be: You should implement your own file format mapper!

At this point is where you need to code some things to deal with your custom format. You can see in the CREATE TABLE statement definition that it supports a file_format as well as a row_format (which you have been using to deal with JSON linking it to the SerDe package of your choice). So, what do you have to implement to provide a class to the INPUTFORMAT and OUTPUTFORMAT file_format customizers?

Well you will need to implement logic for the following:

Fortunately, you have an abstract base class for each one of them that you could use as your starting points:

In the rest of this post I’ll show you how you can easily implement the InputFormat<K,V> interface dealing with your custom data, and helping you to define and execute a SELECT clause on your data.

So, for the InputFormat interface I’m inheriting from the FileInputFormat base class and called my custom formatter ‘JsonInputFormat’ as you can see below:

public class JsonInputFormat extends FileInputFormat<LongWritable, Text>
 implements JobConfigurable {

public void configure(JobConf conf) {
 }

protected boolean isSplitable(FileSystem fs, Path file) {
 return true;
 }

public RecordReader<LongWritable, Text> getRecordReader(
 InputSplit genericSplit, JobConf job, Reporter reporter)
 throws IOException {

reporter.setStatus(genericSplit.toString());

try {
 return new JsonInputRecordReader(job, genericSplit);
 } catch (JSONException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
 return null;
 }
 }
}

The main purpose of this class is to instance and provide an instance of the org.apache.hadoop.mapred.RecordReader<K,V> which will be the interaction point with Hive. For this interface, I implemented a very simple class that loads the file from the input split metadata and parses it using the rcongiu/Hive-JSON-Serde package parsing capabilities (this is not strictly necessary, as you could use any JSON parser that you prefer and return JSON objects as text, but as I’ve been working with it I used it here as well).

You can see this class implementation below:

public class JsonInputRecordReader implements RecordReader<LongWritable, Text> {

private JSONArray jsonArray;
 private int pos = 0;

public JsonInputRecordReader(Configuration job, InputSplit split)
 throws IOException, JSONException {
 if (split instanceof FileSplit) {
 final Path file = ((FileSplit) split).getPath();
 FileSystem fs = file.getFileSystem(job);
 FSDataInputStream fileIn = fs.open(file);
 try {
 this.jsonArray = new JSONArray(fileIn.readLine());
 } finally {
 fileIn.close();
 fs.close();
 }
 }
 }

@Override
 public void close() throws IOException {
 }

@Override
 public LongWritable createKey() {
 return new LongWritable();
 }

@Override
 public Text createValue() {
 return new Text();
 }

@Override
 public long getPos() throws IOException {
 return this.pos;
 }

@Override
 public float getProgress() throws IOException {
 return pos / this.jsonArray.length();
 }

@Override
 public boolean next(LongWritable key, Text value) throws IOException {
 if (pos < this.jsonArray.length()) {
 key.set(pos);

try {
 value.set(this.jsonArray.getJSONObject(pos).toString());
 } catch (JSONException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
 value.set("{ 'error': '" + e.getMessage() + "' }");
 }

pos++;

return true;
 }

return false;
 }
}

The key points at this class are the ctor() and next() methods. In the ctor(), as you can see, I’m loading the file where the JSON array is contained and give its content to the JSONArray ctor (this class is from the SerDe library and actually parses the input). Internally, the instance is maintaining a index (pos variable) and when Hive ask it for next() record I’m just returning the required element from the JSONArray object I’ve created before. As you can also see the return format is just plain text, so there is no need to use complex JSON parsing libraries at this point – just one that could handle objects within an array and returns them).

Having this in place you can define your table as following:

CREATE EXTERNAL TABLE my_table(blogid STRING, data INT, name STRING) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.mk.jsonserde.JsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION /my_location

Note: As you can see I’m using a default OUTPUTFORMAT implementation as I don’t want to write results anywhere else. In this case the one I’ve picked up discards the keys as I don’t need them also.

I hope these tips help you in walking your own path to use Hive against custom logs… I’ll let you investigate on the implementation of the OutputFormat in case you want to write results to an specific place :)


Viewing all articles
Browse latest Browse all 7

Latest Images

Trending Articles





Latest Images