Upcoming Blog Posts

Prioritized upcoming blog posts

Capture some/most/all of the examples from my upcoming DevNexus preso; Transformation Processing Smackdown; Spark vs Hive vs Pig
Create Spark RDD & DataFrame incantations of the "calculate salary statics w/' postings associated with Open Georgia Analysis

Non-prioritized ideas for upcoming blog posts

Grow my Hive Performance Workshop
- Blog & YouTube existing learnings (inc. plug for Streever's data gen tool)
- Intro State & Dept tables and show map-side joins w/TSV+MR as well as ORC+Tez (B&YT!)
- B&YT using blueprints to stand up 2nd cluster (then add spark CLI)
- B&YT Hive import/export on TSV customer table and DistCp based movement coupled w/DDL recreation on ORC-based Customer table
- B&YT bucket joins (after moving all ORC table)
- B&TY the spark SQL compare/constrasts on all the various data formats, compaction levels and partitions to see how it all adapts
Storm reliable operations
Fix streever's Kafka generator
Build out exploring Kafka Streams on YARN
what are "good" terasort numbers (and are they "good for anything"?)
How to use (and review) YARN "distributed shell" app
Fix MOYA (not Slider) bug for Hadoop 2.6
Sample Slider application
exploring apache hive's sql authorization (grants, roles & other fun stuff)
how does apache ranger handle hive roles? (it doesn't!)
Hive's export/import operations (note to self; tracking in O.F.)
Test drive of Hive 14's CRUD operations (note to self; tracking in O.F.)
Local DataNode disk balancing options (note to self; tracking in O.F.)
Typical data ingestion workflow (note to self; tracking in O.F.)
- Sqoop'ing some data
- Transformation/enrichment with Pig
- Accessing it from Hive
- Pulling it all together with Oozie
- Best practices of location/naming/structure of code & config for all components of the workflow
- Maybe a redo of this workflow using Cascading?
Recap of Summit preso if only to provide links to deck and recording; loosely based on http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
Using snapshots with archiving solution
Support Hadoop vendors like you'd support PBS
Change hostnames & IPs of all hosts in a HDP cluster
HBase via JDBC (using Phoenix)
HBase via JDBC (part deaux; just using HIve)
Pig schema reuse (and why not really good)
Connecting to SparkSQL, possibly as suggested here.
Managing Kafka offsets automagically
Playing with HBase versioning (include deleting a range of cells)

If you have some things you'd like to see, please share them in the comments.

If comments are unavailable below, please see red notes in left-hand nav.