Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Prioritized upcoming blog posts

  1. Capture some/most/all of the examples from my upcoming DevNexus preso; Transformation Processing Smackdown; Spark vs Hive vs Pig
  2. Create Spark RDD & DataFrame incantations of the "calculate salary statics w/' postings associated with Open Georgia Analysis

Non-prioritized ideas for upcoming blog posts

  • Grow my Hive Performance Workshop
    • Blog & YouTube existing learnings (inc. plug for Streever's data gen tool)
    • Intro State & Dept tables and show map-side joins w/TSV+MR as well as ORC+Tez (B&YT!)
    • B&YT using blueprints to stand up 2nd cluster (then add spark CLI)
    • B&YT Hive import/export on TSV customer table and DistCp based movement coupled w/DDL recreation on ORC-based Customer table
    • B&YT bucket joins (after moving all ORC table)
    • B&TY the spark SQL compare/constrasts on all the various data formats, compaction levels and partitions to see how it all adapts
  • Storm reliable operations
  • Fix streever's Kafka generator
  • Build out exploring Kafka Streams on YARN
  • what are "good" terasort numbers (and are they "good for anything"?)
  • How to use (and review) YARN "distributed shell" app
  • Fix MOYA (not Slider) bug for Hadoop 2.6
  • Sample Slider application
  • exploring apache hive's sql authorization (grants, roles & other fun stuff)
  • how does apache ranger handle hive roles? (it doesn't!)
  • Hive's export/import operations (note to self; tracking in O.F.)
  • Test drive of Hive 14's CRUD operations (note to self; tracking in O.F.)
  • Local DataNode disk balancing options (note to self; tracking in O.F.)
  • Typical data ingestion workflow (note to self; tracking in O.F.)
    • Sqoop'ing some data
    • Transformation/enrichment with Pig
    • Accessing it from Hive
    • Pulling it all together with Oozie
    • Best practices of location/naming/structure of code & config for all components of the workflow
    • Maybe a redo of this workflow using Cascading?
  • Recap of Summit preso if only to provide links to deck and recording; loosely based on http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
  • Using snapshots with archiving solution
  • Support Hadoop vendors like you'd support PBS
  • Change hostnames & IPs of all hosts in a HDP cluster
  • HBase via JDBC (using Phoenix)
  • HBase via JDBC (part deaux; just using HIve)
  • Pig schema reuse (and why not really good)
  • Connecting to SparkSQL, possibly as suggested here.
  • Managing Kafka offsets automagically
  • Playing with HBase versioning (include deleting a range of cells)

If you have some things you'd like to see, please share them in the comments.

...

  1. Hive insert-only files
  2. Bucketing w/ACID
  3. Spark submission w/Livy
  4. dbVisualizer w/Hive & Spark
  5. Solr Open Georgia Analysis

Other things to Learn & Blog