Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

In 2020, I began blogging at http://lestermartin.blog as detailed in my last blog post below and the sister-article at https://lestermartin.wordpress.com/2020/04/08/moving-my-tech-blog-already-missing-confluence/.

I absolutely love using Confluence for my tech blog, but I’ve had a number of issues with people getting access to it. The biggest two problems were that mobile “wrapper” links that sights like LI & FB produce could not render images and that anyone who is already using Confluence (work uses for most people) cannot actually see my anonymous pages as the system expected me to grant them access. As title says, I’m MISSING CONFLUENCE!!…
This is a quick blog post to show how minor and major compaction for Hive transactional tables occurs. Let’s use the situation that the hive acid transactions with partitions (a behind the scenes perspective) post leaves us in. Here it is!…
Ever since Hive Transactions https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions have surfaced, and especially since Apache Hive http://hive.apache.org/ 3 was released, I’ve been meaning to capture a behind-the-scenes look at the underlying delta ORC https://orc.apache.org/ files that are created; and yes, compacted. If you are new to Hive’s ACID transactions, then the first link in this post as well as the Understanding Hive ACID Transaction Table http://shzhangji.…
The Apache ORC https://orc.apache.org/ file format has been used heavily by Apache Hive http://hive.apache.org/ for many years now, but being a bit of a “binary file format” there just isn’t much we can do with basic tools to see the contents of these files as shown below. $ cat orcfile ORC P1 ???>P>??be!Q%~.×¢?d!?????T ?; DoeSmith(P4??be!%..&wG!?? ? ??'LesterEricJohnSusie FdEBR F6PDoeMartinSmithGATXOKMA??? ??]?M?Ku??????9?sT?#?ްͲ㖆O:^xh?>??FWe?Pve??æ¡¿F?Ó²?LuS????b?` `??`???/p?_?]C?8???kQf?…
This blog post introduces the three streaming frameworks that are bundled in the Hortonworks Data Platform (HDP) https://hortonworks.com/products/data-platforms/hdp/ – Apache Storm, Spark Streaming, and Kafka Streams – and focuses on the supervision features offered to the topologies (aka workflows) running with, or within, these particular frameworks. This post does not attempt to fully described each framework nor does it provide examples of their usage via working code.…
Today is one of those days when I thought I knew something, stood firm with my assumption, and then found out that I wasn’t as right as I thought. Yes, a more humble person might even say they were wrong. But… I’m not totally wrong, but surely not totally right! Much like my discoveries in learning something new every day (seems hdfs is not as immutable as i thought) https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2017/10/03/126681089, this time was also about Hadoop’s HDFS.…
As the title suggests, this posting was something I came up with AFTER I published the first three installments of my Open Georgia Analysis https://martin.atlassian.net/wiki/spaces/lestermartin/pages/19857431/Open+Georgia+Analysis way back in 2014. And yes, you might have also noticed I took a long break from blogging about Big Data technologies in 2018 and I’m hoping to change that for 2019. On the other hand, my personal blog https://martin.atlassian.…
Well… it looks like I have not published a single professional blog posting all year, so hopefully this will jumpstart my efforts. And, of course, I’m behind the eight ball on this one! It is the Saturday before the Monday night user group presentation I committed a long time ago to present at, https://www.meetup.com/Atlanta-Net-User-Group/events/244527107/ https://www.meetup.com/Atlanta-Net-User-Group/events/244527107/, and I’m FINALLY getting around to fully testing this all out.…
Well... it seems the old adage about learning something new every day just kicked me in the pants.  I have always stood firm on the statement that HDFS files were immutable, but I told folks that you could "game the system" by appending to a file.  Of course, I continued to stand firm on the immutability line as that meant that the newly added data (on the end of the file) was just captured in additional blocks. Then today, the Big Data Bear https://www.linkedin.…
The Hortonworks Community Connection http://community.hortonworks.com/ (HCC) is a GREAT resource for folks helping each other out in the big data community with special focus on all things Hadoop (including components like HBase) and Spark.  To help encourage folks to continue to provide answers to questions, you should "Accept" an answer if there is one that best helps you out.  This is just a simple link at the bottom of an answer. ClickAccept.jpg When you do that,…
NOTE: This is a corner-case blog post and really only useful for those who find this entry from a very specific Google search!!  Additionally, I'm filing a support case and expect the problem to be resolved long before it becomes a big issue for many.  But... there are those of us on the latest-greatest version and get to find this stuff out.  ;-) The Problem (in a nutshell) Storm in HDP 2.5.3 brings in version 1.6.6 of org.slf4j:log4j-over-slf4j instead of the required version 1.7.21.…
So you get your ops guy to stand up a "secure" (aka Kerberos-enabled) Hadoop cluster and then you try to create a table in the shell.  Low and behold, you then get slammed with an AccessDeniedException like shown below. [student2@ip-172-30-0-42 ~]$ kinit Password for student2@LAB.HORTONWORKS.NET: [student2@ip-172-30-0-42 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_432201241 Default principal: student2@LAB.HORTONWORKS.…
There I was on an AWS hosted node trying to access port 2181 and 9092 on another AWS node where I just followed the instructions at http://kafka.apache.org/documentation/#quickstart http://kafka.apache.org/documentation/#quickstart to get a stand-alone instance of Kafka running.  After some exceptions that suggested I could not reach these ports, I fell back to trusty old telnet to verify that was problem. [root@ip-172-xxx-xxx-86 kafka]# telnet kafka 22 Trying 172.xxx.xxx.45...…
This year's DevNexus https://devnexus.com conference was great and I was thrilled to be a speaker https://devnexus.com/s/speakers/17531.  They did not record the sessions and post them online, so feel free to see my dress rehearsal video https://www.youtube.com/watch?v=36_MayK5eU4 and download my presentation https://www.slideshare.net/lestermartin/transformation-processing-smackdown-spark-vs-hive-vs-pig if my compare/contrast of Pig,…
Those who listen to National Public Radio http://www.npr.org/ (NPR) probably hate pledge drive season.  That is probably because they fall into one of two camps.  The first, and biggest, are those who enjoy some of the programs broadcasted, but cannot understand why they are being pestered to donate and just want the asking for funds to stop.  The second group, and more profound of the two, "get it" and are eager to help fund NPR.…
Being a man of a "certain age" (that was a good show https://en.wikipedia.org/wiki/Men_of_a_Certain_Age) and staring down another one of those birthday's that end in a 0, I was a little disturbed to read an email version of Marc Cenedella's The Y2K Bug... on your resume https://www.theladders.com/career-advice/y2k-bug-resume/ blog post.  What bothered me so much was Marc's clear comment that ageism is out there and his answer was Don’t list any dates on your resume before the Year 2000.…
A colleague of mine and I were having the proverbial 'the grass is always greener on the other side of the fence http://www.urbandictionary.com/define.php?term=The%20grass%20is%20always%20greener%20on%20the%20other%20side%20of%20the%20fence" discussion when comparing job responsibilities/opportunities when I introduced him to The Suck Continuum™.  This is my tried and tested model of what happens when one changes a job.  It basically goes like this.…
The term "agile" has really been on my mind lately and the simple & novel write-up still holds true. http://agilemanifesto.org/ http://agilemanifesto.org/ Manifesto for Agile Software Development We are uncovering better ways of developing software by doing it and helping others do it.…
My blogging has been drying up lately as I've mostly been focused on trying to add value within the Hortonworks Community Connection (HCC) forums where I ran into this question; https://community.hortonworks.com/questions/50243/pig-inner-join-with-different-keys.html https://community.hortonworks.com/questions/50243/pig-inner-join-with-different-keys.html.  This person was having trouble performing an inner join with Pig across four datasets.…
A common use case that can be easily addressed with Pig is to break an input file into separate files based on one of the record's attributes.  An easy thing to visualize would be breaking this up on a date as I will show in my quick example, but it could be any relevant attribute such as sales region or originating country.  So, let's start with a simple input file to process. 2016-01-01,field1-01a,field2-01a,field3-01a 2016-01-02,field1-02a,field2-02a,field3-02a 2016-01-03,field1-03a,…
What propeller-head technologist doesn't like to get a shiny new toy?!?!  I know I sure do and for me, my next "gadget" was a shiny new Intel NUC to play with.  Specifically, I went with the NUC5CPYH http://www.intel.com/content/www/us/en/nuc/nuc-kit-nuc5cpyh.html model. 1a.JPG As suggested in the picture above, the NUC is a full system shy of a hard drive and some memory.  It even has wifi and bluetooth!  Here's how I spec'd mine out which came in around $215.…
I'm writing this blog post for those learning Spark's RDD programming model and who might have heard that using the mapPartitions() transformation is usually faster than its map() brethren and wondered why.  If in doubt... RTFM, which brought me to the following excerpt from http://spark.apache.org/docs/latest/programming-guide.html http://spark.apache.org/docs/latest/programming-guide.html. map.pngmapPartition.png The good news is that the answer is right there.…
These instructions have been tested on Rev 1.2 of the HDP Operations: Hadoop Administration I http://hortonworks.com/training/class/hdp-operations-hadoop-administration-fundamentals/ course. These steps are captured to support a client who wants to do the Installing HDP lab from Hortonworks' Admin I course, but use a non-root user for the install.  The current version of this course is based on HDP 2.3.0 and you can visit http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.…
HWX-University-logo-4-268x300.jpgHortonworks University has always offered our one-day self-paced HDP Overview: Apache Hadoop Essentials http://hortonworks.com/training/class/hadoop-essentials/ at no cost to anyone who registers through our Learning Management System (LMS).  We also present it on-demand as an instructor-led course for clients and at key conferences.  The Hadoop Summit in Dublin http://hadoopsummit.…
Career Tip #87: When the VP of HR suggest you finish a task "to avoid jeopardizing your employment"... well.. you'd better do it right now.
ask-not-what-your-country-can-do-for-you-ask-what-you-can-do-for-your-country-quote-1.jpgNational pride is almost always directly related to service made to that nation. There is absolutely nothing wrong with expecting rights granted to you by your nation, but sincere accolades and admiration should be awarded to those willing to sacrifice for their country.  Those citizens have earned their rights! While I am by no means in the same class as JFK, I believe his famous, and much more eloquent,…
Career Tip #86: Learn a lesson from Homer; never try.  Now, I'm being a bit facetious here, but I have worked at one of the oldest and largest computer hardware companies in the word where I sat in an employee town hall to hear an HR rep tell several hundred people that we "should be grateful we have a job".  Yes, that was motivating for sure.  Motivated me right out that darn door!…
markdowntypewriter.jpgAs I called out in viewing diffs between powerpoint decks (with a little help from adobe), I'm on a hunt for the ultimate approach to allow multiple folks to edit presentations without having them step on each other's toes and that allows for normal source code control to be applied to the "source" file(s).  Using markdown still seems like the best approach, but I still haven't found the perfect solution.  Deckset http://www.decksetapp.…
Since I transitioned to training (and loving it),   I've been pondering how best to have multiple people work on presentation decks at different times just like one does with source code.  PowerPoint being a "binary format" doesn't really allow for this.  Yes, there's the built-in Compare & Merge functionality http://www.howtogeek.com/70530/compare-and-merge-different-versions-of-your-presentations-in-powerpoint/, but in practice it really has not worked out well for me.…
I continue to have a blast working working at Hortonworks http://hortonworks.com/ and enjoy growing my Hadoop http://hadoop.apache.org/ & Big Data skills while working on interesting projects that can leverage the bleeding edge technology that supports these efforts.  I joined at the beginning of 2013 and have worked in our Professional Services (aka Consulting) team since then.  Again, it has been a blast!…
If you are reading this then the JBOD http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/ talk for Hadoop has probably already sunk in.  Letting the worker nodes have as many spindles as possible is a cornerstone to this strategy whose overall goal is to spread out the I/O and to ensure data locality.  How many spindles per node?  Well,…
Eventually, you will be ready to go well beyond something as simple as the hadoop mini smoke test (VERY mini) to build more confidence in your Hadoop cluster.  This posting is going to introduce Hortonworks' Hive TestBench https://github.com/hortonworks/hive-testbench whose focus is on enabling queries from the Transaction Performance Processing Council (TPC http://www.tpc.org)'s TPC-H http://www.tpc.org/tpch/default.asp and TPC-DS http://www.tpc.org/tpcds/default.…
UPDATE: Please note the warning at the bottom of post of inability to consistently start/stop HDP via Ambari on this test cluster which has since been decommissioned.  There is still tons of good AWS/EC2 information in this post, but as of right now I can not fully guarantee I've provided any/n/everything needed to be completely successful. This is a sister post to installing hdp 2.2 with ambari 2.0 (moving to the azure cloud), but this time using AWS http://aws.amazon.com/' EC2 IaaS servers.…
What a humbling experience to have the opportunity to present at the 2015 Hadoop Summit http://2015.hadoopsummit.org/san-jose/ conference in San Jose.  I've done a decent number of user group presentations over the years, and even have presented Hadoop topics to audiences as big as 500, but this is the first time I have talked at a major industry conference and I had a blast.  It was just cool to have a "presenter" badge and to have my name in all of the conference literature. 1a.JPG  1b.…
I've got a lot of miles running small testing clusters leveraging VirtualBox VMs based on the write up I did a while back at building a virtualized 5-node HDP 2.0 cluster (all within a mac), but as the comments section suggests, it is time to move to the cloud.  HDP is continuing to grow and the 2.2 stack I've installed that way simply has more components than it did several months ago in the 2.0 version and these clusters are starting to crawl.  Innovation waits for no one! BTW,…
Nothing makes my blog posts go faster than just pulling together a few links and calling it done.  That's all I have to do on this topic of connecting DbVisualizer https://www.dbvis.com/ with HiveServer2.  Check out David Streever http://www.linkedin.com/in/davidstreever's wiki page HS2 JDBC Client Jars (Hive Server2) https://streever.atlassian.net/wiki/x/DABD to quickly pull together the needed jar files and then surf on over to cyanfr's How I Connected DBVisualizer 9.2.…
There are surely awesome forums out there and I'm in no way trying to suggest I'm offering anything on the scale of a Stack Overflow http://stackoverflow.com/, but I'm trying something new to see if I can help folks on their Hadoop journey.  I've stood up an Ask Lester wiki page where I'm glad to try to answer the burning questions you might have; or at least give you a pointer to someone or some site that might be able to right your proverbial ship.…
Looks like I almost finished up March without publishing a single blog posting, but my quarterly goal of adding to my Professional Certifications save the day (or is it month?).  In the spirit of when I obtained hortonworks' apache hadoop administrator certification (finally), I finished up the Hortonworks Certified Apache Hadoop 2.0 Developer http://hortonworks.com/training/hadoop-2-0-developer-certification/ certification today as you can see by the shiny new certificate below.…
Career Tip #85: If you work in an extremely customer-focused organization and are "asked" to join a critical project that senior leadership has eyes-on, don't immediately chime in with the feedback below in front of the whole team when asked to get your hands dirty. Believe it or not, this even coexists with my thoughts on assholes and prima donnas (you need a few).
A modern Hadoop cluster is a beautifully resilient and robust set of machines working together to bring awesome storage capabilities and processing power.  A modern Hadoop cluster (build upon hardened distributions such as HDP http://hortonworks.com/hdp/) is also a very complex set of orchestrated software components that themselves are leveraging even smaller packages/frameworks and ultimately the underlying operating system and hardware itself.…
Ok... kinda cheating here, so please forgive me.  I got an email with the question below from a valued client that was looking at my hadoop mini smoke test (VERY mini) posting and just before I hit the send button on my response I realized (cheating of course!) I could just spin this into a blog posting.  I hope it is useful to someone else as well. Subject: hadoop jar parameters Per the command execution below, is there a standardized way we can reference “-D” parameter values?…
Obviously, a Hadoop cluster is a complicated beast.  Hortonworks' HDP 2.2 http://hortonworks.com/blog/available-now-hdp-2-2/ is a great example of where advanced Hadoop distributions are going and of the multitude of components coming together to enable the Modern Data Architecture http://hortonworks.com/blog/hadoop-ecosystem-modern-data-architecture/.  Executing an exhaustive test of all the components, or at least the ones you are using, is critical,…
linux-ssh.jpgAll serious Linux administrators have a tool, or few, they use to run commands in mass as it simply isn't practice to ssh into a bunch of machines.  Even for those of us who regularly work only a small number of machines, it is no fun either.  There are many tools out there from free utilities to expensive COTS http://en.wikipedia.org/wiki/Commercial_off-the-shelf solutions as well as plenty of RYO http://en.wikipedia.org/wiki/Roll-your-own_cigarette specials.…
In my ongoing attempts of shameless self-promotion, I'm here to ask you to check out my two submissions below and vote for them on the "Community Choice" page of the 2015 Hadoop Summit website http://2015.hadoopsummit.org/brussels/europe-community-choice/ (even if you don't know what the heck Hadoop http://hadoop.apache.org/ or Hive http://hive.apache.org/ is!!). Apache Hive Performance Tuning https://hadoopsummit.uservoice.…
Obviously, Hadoop's holistic set of DataNode worker services along with the NameNode master processes, especially when using HA NN configuration http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.7/bk_system-admin-guide/content/ch_hadoop-ha.html, provide a robust platform for HDFS to survive a failure to a worker node and for the file system to keep on keeping on.  Those worker node failures can take a variety of forms (power supply out, NIC problems, controller is fried,…
These instructions are for "simple" Hadoop clusters that have no sophisticated PAM http://en.wikipedia.org/wiki/Pluggable_authentication_module and/or Kerberos integrations.  They are ideal for the HDP Sandbox http://hortonworks.com/products/hortonworks-sandbox/ or other such "simple" setups like the one called out in building a virtualized 5-node HDP 2.0 cluster (all within a mac)  that rely on "local" users. For all command examples, replace $theNEWusername with the username being created.…
If you really know how many variables are at play in a "typical" Hadoop cluster (including which components to use & what use cases are most important to you) it is easy to see where there aren't too many node sizing guides published out there.  That said, I'll go out on a limb and offer my personal sizing guide for worker nodes. Spec out your worker nodes with multiples of the following logical building block. 1 Hard Drive (1-4TB in size)  –  2 CPU Cores  –  10 GB of RAM With that approach,…
Ok... I'll say it... the technology patent process in the US must officially be broken -- if not simply ridiculous.  Proof?  Just check out US20130275363 http://www.google.com/patents/US20130275363 which was awarded for the IDEA of a "meta-data driven data ingestion using mapreduce framework".  swPatents.jpgSeriously?  Maybe I should file (and probably get awarded) a patent for the IDEA "get out of bed on workdays so you can go to your job" which is just about as obvious.…
When a language, framework, tool, or product changes a basic datatype behavior, it is surely time to see how this might affect you.  Hive did such a thing when version 13 was introduced.  swiper.png The issue is documented quite well on the Hive wiki write up for the Decimal datatype https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Decimals, but I give credit to Kevin Risden http://kevin.risden.…
Hive has come a long way in the past couple of years – especially due to the Stinger Initiative that culminated in Hive 0.13 http://hortonworks.com/blog/announcing-apache-hive-0-13-completion-stinger-initiative/ which is already available in some of the leading Hadoop Distribution Components & Versions.  As the pretty picture below shows, the Hive community hasn't stopped there and has publicly declared, yet another, three-release initiative http://hortonworks.…
Let me start this blog post by clearly saying that I'm not suggesting that you should not stand up a Hadoop cluster if you have only allocated two hosts to serve as master nodes as it makes great sense that you should get started on whatever you have!!  I am, however, saying that you will need to violate the mantra of not collocating master & worker daemons on the same host stereotype and/or you will have to live without the evolving HA capabilities inherent to many of the key master processes.…
2014 has been a very cool year for me professionally and a very busy one as well.  I've had it on my plate to sit down and take the test for the Hadoop 2.0 Administrator Certification http://hortonworks.com/training/hadoop-2-administration-certification/ for a long time and I've simply struggled to find the time. Fortunately, I finally did and look at the shiny new "certificate" I got! If interested in my personal feelings about Professional Certifications,…
The component referred to as "XA Secure (Apache Argus)" finally settled down with its finalized open-source name and is now referred to as Apache Ranger.  I've had some very positive hands-on experiences with Ranger since this blog posting was written and I am very enthusiastic about its place in the Hadoop security stack.  Check out the Apache project http://ranger.incubator.apache.org/ or Hortworks product page http://hortonworks.com/hadoop/ranger/ for more information on Ranger.…
These corrections https://martin.atlassian.net/wiki/pages/diffpagesbyversion.action?pageId=27885570&selectedPageVersions=10&selectedPageVersions=9 were made on 9/2/2015 to this blog posting. So... time to eat some crow.…
For the 50-ish folks that made it out to the 7/28 Atlanta .NET User Group meeting http://www.meetup.com/Atlanta-Net-User-Group/events/193980882/, thanks for making me feel so welcome.  I know we went from 0 to 100 in an incredibly short period of time and were only able to go an inch deep and a quarter mile wide in this open-source collection of technologies, but I'm hopeful my title was appropriate and that I was able to "demystify" Hadoop some for everyone. As promised,…
This blog post is for anyone who would like some help with creating/executing a simple MapReduce job with C# – specifically for use with HDP for Windows http://hortonworks.com/blog/hdp-2-0-windows-ga/.  For my Hadoop instance, I'm using the virtual machine I had fun during my installing hdp on windows (and then running something on it) effort.  As this non-JVM language will ultimately require the use of Hadoop Streaming http://hadoop.apache.org/docs/r1.2.1/streaming.html,…
As usual, I'm running a bit behind on my extracurricular activities.  What is it this time?  Well, I'm on the hook to deliver a "Hadoop Demystified" preso/demo to the Atlanta .NET User Group http://www.meetup.com/Atlanta-Net-User-Group/ in less than a week as identified here http://www.meetup.com/Atlanta-Net-User-Group/events/193980882/.  Truth is... I've delivered this before, but this time the difference will be that I want to showcase HDP on Windows http://hortonworks.…
Many people have heard of the "small files" concern with HDFS.  Most think it is related to the Namenode (NN) and its memory utilization, but the NN really doesn't care much if the files it is managing are big or small -- it really is concerned about how many there are. This topic is a fairly detailed and better described via sources such as this HDFS Scalability whitepaper https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf,…
I finally traded in my old 2002 Volvo S80 http://www.volvocars.com/us/all-cars/volvo-s80/ to get the family a new 2014 Volvo XC90 http://www.volvocars.com/us/all-cars/volvo-xc90/. It is not for me; I'm the "dad", so I get the hand-me-down 2012 Honda CR-V from my wife.    I thought I'd share how Volvo's "don't fix it if it ain't broke" mindset aligns with a couple of my core beliefs: Better is the Enemy of Done Consistency is King Like anything, a picture is worth a 1000 words.…
If you find yourself needing to setup Hortonworks Data Platform (HDP) with Ambari in an environment that users and groups need to be pre-provisioned instead of simply created during the install process, then don't fret as Ambari has got you covered.  This write-up piggybacks the HDP Documentation http://docs.hortonworks.com/ site and uses HDP 2.1.2 http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/index.html along with Ambari 1.5.1 http://docs.hortonworks.com/HDPDocuments/Ambari-1.5.1.…
As seen in building a virtualized 5-node HDP 2.0 cluster (all within a mac) it is (relatively) easy to build a full-featured multi-node Hadoop cluster using virtualization technologies such as VirtualBox.  Obviously, I choose to install Hortonworks Data Platform (HDP) when I'm doing such an activity and I also leverage Ambari.  With my setup all my my nodes have access to the internet which lets each connect to the Hortonworks Public Repo when it needs it,…
hive_logo.pngThis post represents the completion of the trilogy started in use mapreduce to calculate salary statistics for georgia educators (first of a three-part series) and use pig to calculate salary statistics for georgia educators (second of a three-part series).  As the name suggests, we're going to try to solve a simple question (specifically the one listed at Simple Open Georgia Use Case) with Apache Hive http://hive.apache.org/ and compare/contrast a bit on this approach vs.…
pig.pngIn this the second installment of a three-part series, I am going to show how to use Apache Pig http://pig.apache.org/ to solve the same Simple Open Georgia Use Case that we did in use mapreduce to calculate salary statistics for georgia educators (first of a three-part series), but this time I'll do it with a lot less code.  Pig is a great "data pipelining" technology and our simple need to parse, filter, calculate statistics, sort,…
This is the first of a three-part series on showing alternative Hadoop & Big Data tools being utilized for Open Georgia Analysis.  The data we are working against looks like the following which is an include of the Format & Sample Data for Open Georgia wiki page. Format & Sample Data for Open Georgia mapReduce.jpg In this first installment, let's jump right in where Hadoop began; MapReduce.  After you visit Preparing Open Georgia Test Data and get some test data loaded into HDFS,…
Every time I go to find a simple matrix of the two major open-source Hadoop distributions' component version list I simply can't find one.  So... I created my own!! The good news is that this blog posting is simply including the Hadoop Distribution Components & Versions wiki page so whenever it gets updated, this gets updated.  Hadoop Distribution Components & Versions img-myhadoop-bigger4.jpg
My alter-ego's (jazzyearl) Twitter feed https://twitter.com/EarlsOfWisdom was a bit prolific a while back. He (me!) offered up a few more last week, too. I hope they at least tickled your funny bones.
I needed to manually install Hue on my little cluster I previousy documented in Build a Virtualized 5-Node Hadoop 2.0 Cluster so I thought I'd document it as I went just in case it worked (and if there were any tweaks from the documentation).  The Hortonworks Doc site URL for the instructions I used are at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_installing_manually_book/content/rpm-chap-hue.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.…
hive_logo.pngAs a follow-up to create and share a pig udf (anyone can do it) I thought I'd post a similarly focused write-up on how you can put your custom Hive UDF https://cwiki.apache.org/confluence/x/MoOhAQ jars on HDFS to let all users utilize the functions you create.  As detailed in HIVE-6380 https://issues.apache.org/jira/browse/HIVE-6380, if you are already on Hive 0.13 (HDP 2.1) then notice the one-liner way to do all of this at the bottom of that last Hive wiki link.  As for me,…
apache_pig.png The Apache Pig project's User Defined Functions http://pig.apache.org/docs/r0.12.0/udf.html gives a pretty good overview of how to create a UDF.  In fact, I stole my simple UDF from there.  For Pig UDF's the obligitory "Hello World" program is actually a "Convert to Upper Case" function.  For this effort, I'm using the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ (version 2.0).  Once you have that setup operational,…
This write-up was for an issue, and resolution, on a HDP 1.3.2 http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/index.html installation.  Default properties (and behavior!) can surely change in future releases, but the general message should be relevant regardless which version of Hadoop you are using. One of my clients came to me with a concern about Oozie apparently running their Hive script much slower than when they kicked if off via the CLI with a hive -f SCRIPT_FILE.hql command.…
See the comments section which identifies a better way to do this using FixedWidthLoader http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/FixedWidthLoader.html. I was talking with a client earlier this week that is using the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ to jumpstart his Hadoop learning (hey, that's exactly why we put it out there).  sandbox_logo_transparent.pngThanks to all the Sandbox tutorials http://hortonworks.…
What a great (and exhausting!) week at the Hortonworks Palo Alto office https://www.google.com/maps/place/Hortonworks/@37.4347263,-122.1087409,17z/data=!3m1!4b1!4m2!3m1!1s0x808fb653780f055b:0x51e9df38065b91ac.  I learned a lot; and more importantly I met some great people.  I'm so excited to be on this journey and to be at Hortonworks!!…
First up, this post has a very limited audience thus I won't do my normal promotional blast to my worked-oriented LinkedIn http://www.linkedin.com/in/lestermartin and Facebook http://facebook.com/lester.martin.professional profiles.  This is something only those who use Confluence https://www.atlassian.com/software/confluence could appreciate. The is there a way to resize the columns of a table on Confluence? https://answers.atlassian.…
apache-ambari-project.jpg http://ambari.apache.org/The Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ is a great way to get introduced to Hadoop and to "get up and running in 15 minutes", but at some point in your platform architect role (see disruptive possibilities (the rise of platform architecture) for more details) you will want to built out your first multi-node cluster.  Using Ambari http://hortonworks.…
This blog posting's content was originally on the Build a Virtualized 5-Node Hadoop 2.0 Cluster wiki page, but it just made sense to refactor it into a blog posting based on the short shelf-life it has due to changes in the ever-evolving HDP stack. This write-up is designed to capture the steps required to stand up a 5-node HDP2 (Hortonworks Data Platform) http://hortonworks.com/products/hdp-2/ Hadoop 2.0/YARN https://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.…
Much like in too big to ignore (too boring to read), Disruptive Possibilities http://www.amazon.com/Disruptive-Possibilities-Data-Changes-Everything-ebook/dp/B00CLH387W/ (Kindle edition is FREE) kicks off with discussions about how Big Data will change the world.  Jeffrey Needham surmises the following: DisruptivePossibilities.jpg Big data will bring disruptive changes to organizations and vendors, and will reach far beyond networks of friends to the social network that encompasses the planet.…
foxtrotJava.gif
I was sure lucky enough to make it to Hadoop World http://strataconf.com/stratany2013/ in New York City this year.  Thanks for the Cloudera http://www.cloudera.com team for giving me a pass to the event and to my boss for picking up the travel tab (he hasn't seen my expense report yet -- those rooms aren't cheap in Midtown!).  The whole thing reminded me of JavaOne the first time I went back in 2001.  So much excitement around new technologies.   In fact,…
Yep, "staying on message" and "taking one for the team" are just part of leadership. That said, being the messenger doesn't feel that good when King Leonidas kicks you into a bottomless pit. Especially when it wasn't your fault. sparta-kick.png Career Tip #84: Suck it up!
too_big_to_ignore.jpg I guess I should be fair... Phil Simon's too BIG to IGNORE http://www.philsimon.com/books/too-big-to-ignore/ is really NOT a book for technologists, much less those well on their way on their own "Big Data" journies.  It is clearly labeled as a primer for chief executives, company owners, industry leaders, and business professionals.  Those surely must be the folks that rated this book 4.7 (out of 5.0) "stars" on its Amazon listing http://www.amazon.…
The (in)famous "cat herding" EDS video describes a big chunk of my day and still cracks me up every time I see it.  If it is new to you, enjoy!
My heart sank today at work -- but first some background.  We are a large organization with teams spanning technologies from the mainframe to mainstream Java and .NET frameworks; not to mention a deep investment in C/C++.  With that, it is not hard to image we have a variety of software development maturity levels across all of our teams.  Additionally, our big push (some here call it an experiment!) to agile has taken many different paths due to our very decentralized/autonomous model.…
taking sides (finally)
When you live in Texas you inevitably have to "take sides" on the UT http://www.utexas.edu/ or TAMU http://www.tamu.edu/ rivalry – even if you never went to either of these schools ('93 UNT BCIS http://www.unt.edu/majors/ubcis.htm and damn proud of it!; we've got a heckofu famous alumni http://www.unt.edu/famous-alums.htm list, too).  Where am I on that one?  Well, it will probably anger many of my family, friends & colleagues (including my younger brother), but I have to answer with Hook 'em,…
I am transitioning teams at my employer and the group that I am joining the leadership team of is a big fan of Dean Leffingwell's Scaled Agile Framework http://scaledagileframework.com/ (pronounced SAFe).  I'll have to admit that it is a bit new to me and I wish there was more information available than just from the creator of the framework himself (his framework/product site, his blog and his books) and a single vendor (i.e. Rally http://www.rallydev.com/toolkits/scale-agile-safely-rally).…
Are you asking yourself what is a data scientist?  If so, check out IBM's definition http://www-01.ibm.com/software/data/infosphere/data-scientist/ which isn't too bad.   What am I saying – it sounds awesome!  It's the next great tech-oriented "artist" role out there – what's not cool about that? When I was taking Cloudera's Hadoop Administration http://university.cloudera.com/training/apache_hadoop/administrator.html course (BTW,…
Several years ago Chris Potts bookCovers.jpgwrote his two-part Information Technology and Enterprise Architecture story (or was it a warning?) in FruITion and RecrEAtion.  These stories are intended to express Chris' belief of how IT/EA teams should be fully-engaged in the Strategy of the companies they belong to, and not just be a "cost center".  He uses a novel format in both books as well as follows a character from the first novel into the second.…
Yarn.jpg I was just explaining to a colleague today how Hadoop 2.0 (aka YARN, which stands for Yet Another Resource Negotiator) differs from Hadoop 1.0.  Today's "core Hadoop" consists of HDFS and MapReduce and each have their own master & worker daemon processes.  Specifically, NameNode & DataNode for HDFS and JobTracker & TaskTracker for MapReduce.  This itself makes sense as HDFS and MapReduce are focused on two different things. HDFS offers redundant,…
While working my way through Eric Sammer's Hadoop Operations http://www.amazon.com/Hadoop-Operations-Eric-Sammer/dp/1449327052/ book I came across this call-out from Chapter 9. The propensity for rebooting hosts or restarting daemons without any form of investigation is the opposite of everything discussed thus far.  This particular form of disease was born out of a different incarnation of the 80/20 rule,…
My current employer has been doing some decent renovations at our offices and I stumbled into a "library" on one of the newly jazzed up floors.  You know, one of those rooms that have a bunch of bookshelves that everyone put all of their old & crappy books they just don't want anymore.    The last time I was in this room I saw a book with a catchy title; The No Asshole Rule http://www.amazon.com/Asshole-Rule-Civilized-Workplace-Surviving/dp/0446698202/.…
Hooray for me; I earned my CCDH (Cloudera Certified Developer for Apache Hadoop http://university.cloudera.com/certification/CCDH.html on CDH4) credential today!!  You're probably asking, "so what?" and are wondering what that really means.  I guess before I answer that, I have to give you my personal opinions (hey, it is MY blog) on certification as a whole. 01_Hadoop_full.jpgCertification in the technology field is a tricky one.…
Career Tip #83: If you accept a meeting; attend it.  Hey, things come up and plans change, but isn't the person you are standing up as "important" as YOU?  A quick text, email or call would go a long way. Nothing like getting stood up twice in last couple of days!!  :-(
Over the years I’ve been lucky enough to be a reviewer on a few books.  For Manning http://www.manning.com, I was able to do this on the EJB Cookbook http://www.amazon.com/Ejb-Cookbook-Benjamin-G-Sullins/dp/B005Q8F7SG/, Spring in Action http://www.amazon.com/Spring-Action-Craig-Walls/dp/1932394354/ and Portlets in Action http://www.amazon.com/Portlets-Action-Ashish-Sarin/dp/1935182544/.  Recently, PACKT Publishing http://www.packtpub.…
So… is that the only song Cracker has?  Seriously, I did read NoSQL Distilled http://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620/ recently and wanted to share my review of it.  I've posted a YouTube video as well as put the presentation up on SlideShare at https://www.slideshare.net/lestermartin/nosql-distilled-book-review https://www.slideshare.net/lestermartin/nosql-distilled-book-review. As the description says on both sites, this book is for those new,…
At one of my of my prior employers we had a dev manager (let's call him Rico to protect the guilty) who has been around the block for a while.  Rico is savvy, Rico is suave and Rico will be a "manager" (maybe not so much of a leader...) at his current company a long, long time.  It is not that he solves that many problems; it is that he knows how to remain stain-free. Furthermore, you'll never hear Rico say anything in anger -- heck,…
Long before I became a development manager I truly loved the beginning of a software project. I enjoyed that short period where a few folks started kicking around the requirements and ideas & plans started brewing on how we would design and implement a solution. The next steps of actually starting to build something that aligns to the plans (including the learnings from what the plans didn’t tell us) are even more enjoyable. From experience,…
While enjoying a little time off this holiday season I was digging through my old Army stuff I ran across the “Rangers Handbook http://www.benning.army.mil/infantry/rtb/content/PDF/2011%20RHB%20Final%20Revised%2002-11-2011.pdf” (MCOE SH 21-76) which caused me to pause and reminisce for a little while.  I must admit that I never earned a Ranger tab (I wasn’t even in the “combat arms” branches http://en.wikipedia.org/wiki/United_States_Army_branch_insignia),…
There are a ton of different beliefs of what it takes to be successful in the software development profession.  Some folks out there would tell you to become the very best “Xyz” developer you can possibly be.  While I’m not against be very competent at any particular skill, I’d sure advocate being very good at many different skills as well.   Scott Ambler declared this kind of person a Generalizing Specialist years ago and I wish everyone would read his short essay http://www.agilemodeling.…
Back in late 2010, the Harvard Business Review (HBR) posted Who Should be Your Chief Collaboration Officer (CCO)? http://blogs.hbr.org/cs/2010/10/who_should_be_your_chief_colla.html  The authors called for identifying "someone to look after the whole, by taking a holistic view of what is needed to get employees to work across silos".  They then spent the rest of the blog tossing out who might be the best person (or role) to take on these additional responsibilities (i.e.…
Career Tip #82: Don’t EVER send an email after 9pm with the word “manifesto” ANYWHERE in it.  
OK… I’ll admit it before you read too much. This posting is really a bit of a rant, but I do try to wrap it up with something positive; maybe even insightful. I was in a drive-thru the other day and the personalized license plate on the giant SUV in front of me read, “LUKYONE”. This “lucky” person decided the world was his oyster and proceeded to dump out his coffee as he reached the speaker station. Of course, coffee splashed everywhere. Fortunately, even a bit hit his own vehicle.…
I was recently discussing the old Microsoft personas of Mort, Elvis and Einstein to some folks at work and was shocked that most had never heard of them.  You can easily google these three magical names and find several articles such as this one http://www.codinghorror.com/blog/2007/11/mort-elvis-einstein-and-you.html, here http://blogs.msdn.com/b/ericwhite/archive/2006/05/11/595693.aspx and yet another one http://de.wikipedia.org/wiki/Mort,_Elvis,_Einstein (well… if you can read German).…
My prior employer had a company-wide MediaWiki http://www.mediawiki.org/wiki/MediaWiki instance along with other social/collaborative tools (blogs, forums, etc) in addition to the expected deployment of SharePoint.  To help span the various notifications, an internally-developed aggregator was created to roll up the various activity feeds that are produced into a social-oriented view.  At my current organization, this level of open authoring tooling is not as prevalent.…
I came across a great quote the other day; “Give as few orders as possible. Once you’ve given orders on a subject, you must always give orders on that subject.“  Brownie points to anyone who can source that quote (hint: it is from a SciFi book). I’m not talking about Saint Philip Neri’s (paraphrased) quote of “he who wishes to be perfectly obeyed should give few orders” which itself was directed at government and its influence on citizens.…
  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.