Saturday, November 22, 2014

Falling Walls Conference - Berlin 2014

Which are the next walls to fall?
November 9th 2014, marked 25 years of the falling of the Berlin Wall. The actual day in 1989 was a epic day in the history of the world. It was not just about taking down the physical wall, but it was shattering the barriers which existed between human beings at large. 
Falling Walls Labs Berlin 2014
To commemorate this event, Berlin hosts the famous Falling Walls Labs and Conference every year since its inception and brings the world's leading researchers whose work impacts the area of science, technology and society to share the same to the world. This year, I got accepted and participated in the Falling walls labs as one of the 100 participants from around 40 countries. I received the A.T. Kearney scholarship, which took care of most of the expenses and I am thankful to them for that.

Presenting my talk at Falling Walls Labs 2014
I presented my talk on the topic "Breaking the wall of Building Effective Communities" , sharing my experience volunteering with Mozilla at India and Germany. The format was simple 3 minutes per speaker with a maximum of two slides each. It was really a difference experience to jot down the most essential points in the two slides and explaining them to the audience in just three minutes. The jury too consisted of distinguished scientists and leaders in various fields in science, engineering and technology. It was thus a great experience to talk to so many talented individuals, listening to their work in areas from Automotive engineering to Chemistry to Biology and sharing Mozilla's mission with them all. As the event was live streamed with massive response on twitter (#fallingwalls), it was personally a great event for me as it helped me grow as an individual. The platform allowed me to promote Mozilla's mission to a global audience of intellectuals and sharing my experience of building and working for various communities was spread far across various countries.

All participants of the Falling Walls Labs 2014
Here are my slides:


And last but not the least, I had some time to explore Berlin and celebrated the 25 years of the historic event with the locals. Here are some pictures:

Some Random Landscape of Berlin

River Spree
Berlin Academy of the Konrad-Adenauer-Stiftung (Falling Walls Labs Venue)

Neue Nationalgalerie (Falling Walls Conference Reception)

Radial System V (Venue for the Falling Walls Conference)
Ducks at River Spree

River Spree, Berlin

Me enjoying at the Conference Venue

Balloons formed a mock-up wall at the same place, where Berlin wall once exited.
Later in the evening, the balloons were freed in the night sky.

The tags associated with the each Balloons commemorating 25 years of the event. 

Saturday, June 21, 2014

Pail Framework - Thoughts

In the book Big Data (MEAP), Nathan Marz describes a framework called Pail (dfs-datastores) [1] which is a data storage solution on top of hadoop. It supports schema, merging small files into a large chunk for better hdfs performance etc.
Some points to note about Pail: 
  • Pail is a thin abstraction over files and folders from the dfs-datastores library.
  • Pail makes it significantly easier to manage a collection of records in a batch processing context.
  • The Pail abstraction frees us from having to think about file formats and greatly reduces the complexity of the storage code.
  • It enables you to vertically partition a data-set and it provides a dead-simple API for common operations like appends, compression, and consolidation.
  • Pail is just a Java library and underneath it uses the standard file APIs provided by Hadoop.
  • Pail makes it easy to satisfy all of the requirements we have for storage on the batch layer.
  • We treat a pail like an un-ordered collection of records.
  • Internally, those records are stored across files that can be nested into arbitrarily deep sub-directories
When using Pail, our system will be treated as HDFS and hence we can use Hadoop tools to access these files. Pail files are named using globally unique names, a pail can be written to concurrently by multiple writers without conflicts. Additionally, a reader can read from a pail while it’s being written to without having to worry about half-written files.
Typed Pails
We don’t have to work with binary records when using Pail. Pail lets us work with real objects rather than binary records. At the file level, data is stored as a sequence of bytes. To work with real objects, we provide Pail with information about what type our records will be and how to serialize and deserialize objects of that type to and from binary data.
Ex: To create an integer pail:

Pail<integer> intpail = Pail.create(“/tmp/intpail”, new IntegerPailStructure());
When writing records to the pail, we can give it integer objects directly and the Pail will handle the
serialization. This is shown in the following code:
TypedRecordOutputStream int_os = intpail.openWrite();
int_os.writeObject(1);
int_os.writeObject(2);
int_os.writeObject(3);
int_os.close();
Likewise, when we read from the pail, the pail will deserialize records for us. Here’s how we can iterate through all the objects in the integer pail we just write to:
for(Integer record: intpail) {
System.out.println(record);
}
Pail - Appends 
Using the append operation, we can add all the records from
one pail into another pail. Here’s an example of appending a pail called “source” into a pail called “target”:
Pail source = new Pail(“/tmp/source”);
Pail target = new Pail(“/tmp/target”);
target.copyAppend(source);
The append operation is smart. It checks the pails to make sure it’s valid to append the pails together. So for example, it won’t let us append a pail that stores strings into a pail that stores integers.
There are three types of supported Append operations:
  1. copyAppend
  2. moveAppend
  3. absorb
Pail -- Consolidate
Sometimes records end up being spread across lots of small files. This has a major performance cost associated with it when we want to process that data in a MapReduce job since MapReduce will need to launch a lot more tasks.
The solution is to combine those small files into larger files so that more data can be processed in a single task. Pail supports this directly by exposing a consolidate method. This method launches a MapReduce job that combines files into larger files in a scalable way.
To consolidate a pail we do: pail.consolidate();
Summary
PROS: 
It’s important to be able to think about and manipulate data at the record level and not at the file level. By abstracting away file formats and directory structure into the Pail abstraction, we’re able to do exactly that. The Pail abstraction frees us from having to think about the details of the data storage while making it easy to do robust, enforced vertical partitioning as well as common operations like appends and consolidation. Without the Pail abstraction, these basic tasks are manual and difficult. Vertical partitioning happens automatically, and tasks like
appends and consolidation are just one-liners. This means we can focus on how we want to process our records
rather than the details of how to store those records
CONS:
- Lack of Active Developer Support
- Not many developers working or committing to the official GIT.
- Not may open source projects using it.
- Lack of Documentation.
Additional References:
[1] https://github.com/nathanmarz/dfs-datastores
[2] http://misaxionsoftware.wordpress.com/2012/07/10/step-to-big-data-hello-pail/#more-190