Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Filter by label (Content by label)
showLabelsfalse
max55
showSpacefalse
sortcreation
excerpttrue
reversetrue
typeblogpost
excerptTypesimple
cqllabel = "hadoop" and type = "blogpost"
labelshadoop

Note

NOTE: The remainder of this document has no real meaningful structure and is as much a parking lot of ideas and links that I will SOMEDAY come back to apply some structure to.  Thanks, Lester Martin.

MapReduce Sharing Exercise(s)

For public and/or Equifax audience, create a presentation that helps others "see" what MapReduce is all about.  This will be one, or both, of the following.

...

.

...

Best Practices for 3rd Party JARs

...

Investigate MongoDB Connector for Hadoop as called out at http://www.mongodb.com/press/integration-hadoop-and-mongodb-big-data%E2%80%99s-two-most-popular-technologies-gets-significant.

HadoOLTP

Pronounced Hadoo-L-T-P.  The intention is to move LF2 to layer on top of MapReduce (possibly using MongoDB for the leftovers of the OpenBatch.NET schema) which would eliminate an estimated 80+% of the codebase.

Speculative Execution Enhancement

Consider building out infrastructure that would allow a job submitter to provide a time-out value to use instead of the configurable "speculative execution" option.

Multithreaded Put

See if it would be possibly to add a CLI argument to the hadoop fs -put command that would allow the writing of the file chunks from the client to the cluster to be multithreaded instead of sequential.

Image Removed

Block Sizes and Puts

Default behavior of copying an existing file on HDFS to a new one is to lay down the copy with the configured block size, even if that is different for what the file is using.  See what it would take to offer up a switch that could indicate that one would want to keep the source file's block size if different than the default.

Oozie's Hive Action Need for hive-site.xml in HDFS

Should see if there is a fix that can be submitted back.

Capacity Scheduler Time Considerations

It would seem a somewhat straight-forward implementation efforts to bring time period based configurations to the Capacity Scheduler if the separation of concerns is implemented in the way I'm imagining in the source code.  From core, the only real twist would be having to run a timer of sorts to know when it was time to refresh the configuration.  Additional efforts would be present in enhancing Ambari's upcoming View for the Capacity Scheduler.  Nonetheless, sounds like an interesting feature.