Potential Open-Source Contributions

Just a "what if" page for me to keep track of my random thoughts on where I might be able to make some contributions back to open-source projects.

List HDFS Quotas

Create CLI operation to ask HDFS for a list of all quotas (w/or w/o usage) as need is described in https://community.hortonworks.com/questions/115263/can-you-list-all-of-hdfs-quotas-with-a-single-comm.html.

HadoOLTP

Pronounced Hadoo-L-T-P.  The intention is to move my previous database-driven batch framework (LF2 / OpenBatch.NET) to layer on top of MapReduce (possibly using HBase, or maybe Hive, for the leftovers of the OpenBatch.NET schema) which would eliminate an estimated 80+% of the codebase.  Not 100% sure where this project would live and the more I think about it, the more I think it is a waste of time.  (sad)

Speculative Execution Enhancement

Consider building out infrastructure that would allow a job submitter to provide a time-out value to use instead of the configurable "speculative execution" option.

Multithreaded Put

See if it would be possibly to add a CLI argument to the hadoop fs -put command that would allow the writing of the file chunks from the client to the cluster to be multithreaded instead of sequential.

Block Sizes and Puts

Default behavior of copying an existing file on HDFS to a new one is to lay down the copy with the configured block size, even if that is different for what the file is using.  See what it would take to offer up a switch that could indicate that one would want to keep the source file's block size if different than the default.  Consider bringing that change into DistCp as well.

Oozie's Hive Action Need for hive-site.xml in HDFS

Should see if there is a fix that can be submitted back that would simplify this.

Capacity Scheduler Time Considerations

It would seem a somewhat straight-forward implementation efforts to bring time period based configurations to the Capacity Scheduler if the separation of concerns is implemented in the way I'm imagining in the source code.  From core, the only real twist would be having to run a timer of sorts to know when it was time to refresh the configuration.  Additional efforts would be present in enhancing Ambari's upcoming View for the Capacity Scheduler.  Nonetheless, sounds like an interesting feature.

Enhance sqoop-merge to work with ORC files

http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_merge_literal accounts for text, sequence file and AVRO formats, but doesn't address ORC.  Wondering what that effort would be.

Migrate Hive TestBench to use Beeline

The current Hive TestBench project, https://github.com/hortonworks/hive-testbench, leverages the legacy Hive CLI tool which is disabled on many clusters for security concerns.