SAN FRANCISCO -- VMware administrators that experimented with running big data workloads lived to tell about their...
experiences here this week, while vendors explained how big data principles could actually improve day-to-day IT operations.
One organization that ran big data workloads in the cloud reported that it quickly becomes cost-prohibitive, and that it is much cheaper and easier to run it in-house on -- guess what? --VMware Inc.'s platforms.
Anyone with half a brain can see that the ROI doesn't make any sense.
director of SaaS operations, Identified Inc.
Identified Inc., a software firm that identifies recruiting candidates from social media sites such as LinkedIn and Facebook, said the return on investment (ROI) of building its own Apache Hadoop cluster on VMware infrastructure was less than two months, compared with using Amazon Web Service (AWS) Elastic Map Reduce.
For about $80,000, Identified built a 225 TB Hadoop cluster on eight of Super Micro Computer Inc.'s FatTwin servers, which AWS's calculator said would have cost $43,000 per month to run.
"AWS is great if you're just starting out and testing out Hadoop, but anyone with half a brain can see that the ROI doesn't make any sense," said Sasha Kipervarg, Identified's director of SaaS operations, speaking at a VMworld 2013 session about big data extensions.
As a startup, Identified began operations in AWS about a year ago, but quickly started seeing AWS bills of $40,000 per month for about 200 virtual machines. It decided to bring most of its IT back in-house, but run occasional Hadoop jobs in the cloud.
Then, "at some point, developers needed to use Hadoop all the time," Kipervarg said, and the firm once again saw its AWS bill spike.
However, building and managing your own Hadoop cluster isn't without its challenges. "We had Hadoop developers, but they didn't know anything about the services," for example, how to configure the various Hadoop components.
To that end, Kipervarg adopted Project Serengeti, an open source project led by VMware that automates the configuration and setup of a Hadoop cluster running on VMware infrastructure. After downloading the open virtualization format file and reading the documentation, the firm deployed a 30-node cluster in about 10 minutes.
VMware incorporated Project Serengeti as one component of the Big Data Extensions (BDE) feature included in the upcoming vSphere and vCloud 5.5 release this week. Other features include the virtual Hadoop manager, which provides the automated scale-up and scale-down of the Hadoop cluster, plus Hadoop Virtual Extensions for making Hadoop more virtualization-aware.
Solving for storage
Another feature of BDE is support for running separate Hadoop compute and data nodes and running them on a combination of physical and virtual hardware.
For availability and scalability reasons, it makes sense to run Hadoop components such as the NameNode, JobTracker and TaskTracker as virtual machines, said Jayanth Gummaraju, a VMware software engineer, but running compute and data on the same node, as is usually the case, "limits the scalability of the system."
Separating compute and data nodes also allowed Identified to make better use of its expensive flash SAN storage resources, Kipervarg said. Storing the data on local spinning disks relieved pressure on its SAN without negatively affecting performance. "For sequential data, spinning disk is about equal to the performance of flash," he said.
Another VMware Serengeti customer, shipping giant FedEx, leveraged an existing scale-out network-attached storage system from Isilon Systems for its big data deployment.
"We aim for a common infrastructure for different groups within FedEx," said Chris Greer, enterprise architect for FedEx Services, during a session on virtualizing big data, Hadoop, high-performance computing and cloud-scale applications. "We also require a known security model and a trusted auditable platform."
What's in it for IT?
Elsewhere, applying analytics principles popularized by big data has trickled down to IT environments. VMware's new vCenter Log Insight helps VMware administrators pinpoint key information impacting the performance and availability of their systems.
During VMworld this week, Joe Baguley, VMware chief technology officer (CTO) for EMEA, showed how Log Insight identified 15 relevant events from a pool of 65 million.
"I don't know about you, but I wouldn't want to grep and search through 65 million events," Baguley said.
Indeed, administrators are increasingly asking for insights into billions and billions of data points, said Alan Conley, CTO at monitoring firm Zenoss Inc., which responded by re-architecting the back end of its Zenoss Service Dynamics with big data mainstays Hadoop, HBase and OpenTSDB for improved scalability and performance.
"People are asking for a greater number of objects, with greater frequency and with a longer retention time," Conley said. "When you multiply that out, the amount of data that people want to retain is growing exponentially."