Page Comparison

My landing page for "all things Hadoop", Big Data, and related technologies. The content is rather unstructured right now, but I'll get there. Take a look at David Streever's Hadoop space.

...

Filter by label (Content by label)

showLabels	false
max	550
showSpace	false
sort	creation
showSpaceexcerpt	falsetrue
reverse	true
type	blogpost
labels	hadoop

Note
NOTE: The remainder of this document has no real meaningful structure and is as much a parking lot of ideas and links that I will SOMEDAY come back to apply some structure to. Thanks, Lester Martin.

MapReduce Sharing Exercise(s)

For public and/or Equifax audience, create a presentation that helps others "see" what MapReduce is all about. This will be one, or both, of the following.

MapReduce Demystified - Yet another crash-course intro to MapReduce with a few simple examples such as Word Count or the weather example from the "definitive guide". The goal here would be to help people wrap their heads around what's really happening. UPDATE: Delivering this one internally on July 12, 2013.
MapReduce: Many Ways - Show a specific use case solved many ways. This would include Hadoop MR as well as Hadoop's MR not to mention their Aggregation Framework. Should also show it with ecosystem tools like Hive/Pig as well as the classical SQL approaches. Ideally, it would be cool to show with Datameer as "aha moment" for tools like this one. Here's a Pivotal WC Many Ways article.

Best Practices for 3rd Party JARs

Figure out what the best practice is. Some notes at http://stackoverflow.com/questions/16825821/parsing-json-input-in-hadoop-java to get this topic going.

Generic Convert Uncompressed Text File to Snappy Encoded Sequence File

Based on thoughts from http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/ and http://stackoverflow.com/questions/5377118/how-to-convert-txt-file-to-hadoops-sequence-file-format write a simple utility that converts text files to sequence files and compresses them with Snappy. Or... am I overthinking this and there is a far easier way to do this?

Hadoop in the Small

Of course... want, no NEED, to build a Hadoop cluster with Raspberry PI devices as seen in the following urls:

...

Maybe could do it with Java on the BeagleBoard? Maybe just post a very straight forward post like http://java.dzone.com/articles/getting-hadoop-and-running.

Bureau of Labor Statistics Example

How about a project using the BLS OES datasets?

Integration with MongoDB

Investigate MongoDB Connector for Hadoop as called out at http://www.mongodb.com/press/integration-hadoop-and-mongodb-big-data%E2%80%99s-two-most-popular-technologies-gets-significant.

HadoOLTP

Pronounced Hadoo-L-T-P. The intention is to move LF2 to layer on top of MapReduce (possibly using MongoDB for the leftovers of the OpenBatch.NET schema) which would eliminate an estimated 80+% of the codebase.

Speculative Execution Enhancement

Consider building out infrastructure that would allow a job submitter to provide a time-out value to use instead of the configurable "speculative execution" option.

Multithreaded Put

See if it would be possibly to add a CLI argument to the hadoop fs -put command that would allow the writing of the file chunks from the client to the cluster to be multithreaded instead of sequential.

Block Sizes and Puts

Default behavior of copying an existing file on HDFS to a new one is to lay down the copy with the configured block size, even if that is different for what the file is using. See what it would take to offer up a switch that could indicate that one would want to keep the source file's block size if different than the default.

Oozie's Hive Action Need for hive-site.xml in HDFS

Should see if there is a fix that can be submitted back.

Versions Compared

Old Version 18

New Version 19

Key