Saturday, June 21, 2014

Pail Framework - Thoughts

In the book Big Data (MEAP), Nathan Marz describes a framework called Pail (dfs-datastores) [1] which is a data storage solution on top of hadoop. It supports schema, merging small files into a large chunk for better hdfs performance etc.
Some points to note about Pail: 
  • Pail is a thin abstraction over files and folders from the dfs-datastores library.
  • Pail makes it significantly easier to manage a collection of records in a batch processing context.
  • The Pail abstraction frees us from having to think about file formats and greatly reduces the complexity of the storage code.
  • It enables you to vertically partition a data-set and it provides a dead-simple API for common operations like appends, compression, and consolidation.
  • Pail is just a Java library and underneath it uses the standard file APIs provided by Hadoop.
  • Pail makes it easy to satisfy all of the requirements we have for storage on the batch layer.
  • We treat a pail like an un-ordered collection of records.
  • Internally, those records are stored across files that can be nested into arbitrarily deep sub-directories
When using Pail, our system will be treated as HDFS and hence we can use Hadoop tools to access these files. Pail files are named using globally unique names, a pail can be written to concurrently by multiple writers without conflicts. Additionally, a reader can read from a pail while it’s being written to without having to worry about half-written files.
Typed Pails
We don’t have to work with binary records when using Pail. Pail lets us work with real objects rather than binary records. At the file level, data is stored as a sequence of bytes. To work with real objects, we provide Pail with information about what type our records will be and how to serialize and deserialize objects of that type to and from binary data.
Ex: To create an integer pail:

Pail<integer> intpail = Pail.create(“/tmp/intpail”, new IntegerPailStructure());
When writing records to the pail, we can give it integer objects directly and the Pail will handle the
serialization. This is shown in the following code:
TypedRecordOutputStream int_os = intpail.openWrite();
int_os.writeObject(1);
int_os.writeObject(2);
int_os.writeObject(3);
int_os.close();
Likewise, when we read from the pail, the pail will deserialize records for us. Here’s how we can iterate through all the objects in the integer pail we just write to:
for(Integer record: intpail) {
System.out.println(record);
}
Pail - Appends 
Using the append operation, we can add all the records from
one pail into another pail. Here’s an example of appending a pail called “source” into a pail called “target”:
Pail source = new Pail(“/tmp/source”);
Pail target = new Pail(“/tmp/target”);
target.copyAppend(source);
The append operation is smart. It checks the pails to make sure it’s valid to append the pails together. So for example, it won’t let us append a pail that stores strings into a pail that stores integers.
There are three types of supported Append operations:
  1. copyAppend
  2. moveAppend
  3. absorb
Pail -- Consolidate
Sometimes records end up being spread across lots of small files. This has a major performance cost associated with it when we want to process that data in a MapReduce job since MapReduce will need to launch a lot more tasks.
The solution is to combine those small files into larger files so that more data can be processed in a single task. Pail supports this directly by exposing a consolidate method. This method launches a MapReduce job that combines files into larger files in a scalable way.
To consolidate a pail we do: pail.consolidate();
Summary
PROS: 
It’s important to be able to think about and manipulate data at the record level and not at the file level. By abstracting away file formats and directory structure into the Pail abstraction, we’re able to do exactly that. The Pail abstraction frees us from having to think about the details of the data storage while making it easy to do robust, enforced vertical partitioning as well as common operations like appends and consolidation. Without the Pail abstraction, these basic tasks are manual and difficult. Vertical partitioning happens automatically, and tasks like
appends and consolidation are just one-liners. This means we can focus on how we want to process our records
rather than the details of how to store those records
CONS:
- Lack of Active Developer Support
- Not many developers working or committing to the official GIT.
- Not may open source projects using it.
- Lack of Documentation.
Additional References:
[1] https://github.com/nathanmarz/dfs-datastores
[2] http://misaxionsoftware.wordpress.com/2012/07/10/step-to-big-data-hello-pail/#more-190

Yes! I Am Alive - Reporting from Munich, Germany

It's been 2 years, 2 months and 13 days since my last post on this blog. That's quite a long time of inactivity and quite a long time to catch up in one post, so i will slowly try to bring out my experiences over time. For now, I just want to state that i am active again and this time reporting from Munich, Germany. It's such an amazing place. I currently live at a student wohnheim at Oberschleißheim.
Meine Student Wohnheim
I am now a Master's student at Technical University, Munich and to make it easier for people to catch up, i have re-updated the About Me Page, which has all the current details of my work. I make a special effort to keep my LinkedIn Profile updated, so that too can be a good starting point as well. 

An important distinction about my university is the amazing slide situated at my faculty (Mathematik - Informatik). Below is one picture, I took just before Christmas holidays in 2013. 
Slides and Christmas Tree at TU Muenchen.
I see, that my last post was the one, where i shared my class valedictorian speech delivered in 2012. Well, a lot happened between that day and today. In short: Got a research publication in a top tier conference in NLP, met amazing people at IIT Bombay and TU Munich, traveled to Germany, US and Austria, started learning German [Ja, Ich spreche bisschen Deutsche. :)] and started cooking on my own. Really, I am not joking take a look:
Arhar and Malka Dal  with Rice, Cucumber Raita and Salad
I will be more active, with more techy and nerdy posts, as I have got new and fresh energy this time to drive me through. I will focus towards sharing a lot of experiences, gained working in domain of Natural Language Processing, Machine Learning and Distributed Systems.  So stay tuned for more stuff! 

Danke! Tschüss.