Bossie Awards 2016: The best open source big data tools

The best open source big data tools

Big data, fast data, data in tables, data that -- try as it might -- simply can’t maintain relationships. The problems with processing big piles of data are many and varied, and no one tool can handle it all -- not even Spark. Among this year's Bossies in big data, you'll find the newest, best approaches to leveraging large clusters of machines for indexing and searching, graph processing, stream processing, structured queries, distributed OLAP, and machine learning. Because many processors -- and lots of RAM -- make light work.

[ InfoWorld unveils the Bossies: The best open source products of the year. | The best open source applications. | The best open source networking and security software. | The best open source datacenter and cloud software. | The best open source application development tools. | Stay up on open source with the InfoWorld Linux report. ]

The best open source big data tools

Spark is the in-memory distributed processing framework, written in Scala, that ate up the big data world. With the 2.0 release, its continued ascendency seems guaranteed. Aside from shoring up features like its SQL implementation and making strides in performance, Spark 2.0 goes further in standardizing on DataFrames, the new Structured Streaming APIs, and the all-new and improved SparkSession. All of these bring Spark programmers some clarity and relief, but Structured Streaming may have the biggest impact.

Marking a shift from the batched processing of RDDs to a concept of a DataFrame without bounds, Structured Streaming will make certain types of streaming scenarios (such as change-data-capture and update-in-place) much easier to implement -- and allow windowing on time columns in the DataFrame itself instead of when new events enter the streaming pipeline. This has been a long-running thorn in Spark Streaming’s side, especially in comparison to competitors like Apache Flink and Apache Beam. Spark 2.0 heals the wound. If you haven’t learned Spark yet, it is well past time.

-- Andrew C. Oliver

Beam

Google’s Beam -- an Apache Incubator project -- is giving us a way to not rewrite code every time our processing engine changes. At the moment it appears that Spark may be our programming model of the future, but what if it isn’t? Moreover, if you’re interested in some of the extended features and performance of Google’s DataFlow, then you can write your code in Beam and run it on DataFlow, Spark, or even Flink if that’s your thing.

We like the idea of write-once-run-anywhere so much that no matter how many times we’ve been burned (looking at you Scott McNealy), we’ll buy it. While Beam doesn’t support developer features like REPL, it does give you a great way to future-proof your core distributed computing logic and run it on the engine of the day.

-- Andrew C. Oliver

TensorFlow

Google open-sourced the secret sauce to some of its machine learning wizardry. Whether you’re trying to do character recognition, image recognition, natural language processing, or some other kind of complicated machine learning application, TensorFlow may be the first answer you seek.

TensorFlow is written in C++ but supports coding in Python. Moreover, it finally gives us a convenient way to run both distributed code and optimized parallel code on GPUs and CPUs. This is going to be the next big big data tool we won’t stop talking about.

-- Andrew C. Oliver

Solr

The choice of Hadoop heavyweights like Hortonworks, Cloudera, and MapR, Apache Solr brings trusted and mature search engine technology to the enterprise. Solr is based on the Apache Lucene engine, and the two projects share many committers. You can find Solr behind the scenes at businesses such as Instagram, Zappos, Comcast, and DuckDuckGo.

Solr includes SolrCloud, which leverages Apache ZooKeeper to create a scalable, distributed search and indexing solution that is highly resistant to common problems with distributed systems such as network split-brain. Along with the reliability, SolrCloud is able to scale up or down as required, and it's mature enough to deal with the scale of handling high query volumes across billions of documents.

-- Ian Pointer

Elasticsearch

Elasticsearch, also based on the Apache Lucene engine, is an open source distributed search engine that focuses on modern concepts like REST APIs and JSON documents. Its approach to scaling makes it easy to take Elasticsearch clusters from gigabytes to petabytes of data with low operational overhead.

As part of the ELK stack (Elasticsearch, Logstash, and Kibana, all developed by Elasticsearch’s creators, Elastic), Elasticsearch has found its killer app as an open source Splunk replacement for log analysis. Companies like Netflix, Facebook, Microsoft, and LinkedIn run large Elasticsearch clusters for their logging infrastructure. Furthermore, the ELK stack is finding its way into other domains, such as fraud detection and domain-specific business analytics, spreading the use of Elasticsearch throughout the enterprise.

-- Ian Pointer

SlamData

Coming around to SlamData was a long trip for me. Why would you use MongoDB as your analytics solution? That’s an operational database. However, as SlamData’s Jeff Carr walked me through it, it didn’t seem so insane. There are a lot of new companies and young developers bred on MongoDB. If you have a MongoDB data store and need to run basic analytics, are you going to create a whole Hadoop or other infrastructure for reporting?

That’s a lot of ETL for reporting on one data store! Far easier and saner is to report straight from a replica. SlamData has a SQL-based engine that talks natively to MongoDB. Unlike MongoDB’s own solution, SlamData is not sucking all the data into PostgreSQL and calling it a BI connector. Now that the core technology is open source, I think we can expect to see more adoption as the company focuses on the top end.

-- Andrew C. Oliver

Impala

Apache Impala is Cloudera’s engine for SQL on Hadoop. If you’re using Hive, Impala is an easy way to up your query performance without rethinking how you do everything. A row-based, distributed, massively parallel processing system, Impala is more mature and more thoroughly thought out than the Hive on Spark combo. Even without much tuning, Impala will improve your performance, and I’d stake the results will be better than what you’re likely to experience with Tez given the same level of effort. If you need SQL over some files that you have on HDFS, then Impala might be your best bet.

-- Andrew C. Oliver

Kylin

If you are going to do n-dimensional cube analysis and you want to do so on a modern big data framework, then Kylin is your game. If you’ve never heard of an OLAP cube, then consider a couple tables in an RDBMS where a one-to-many relationship exists, but there is a calculated field that requires fields from both sides. You could query this and calculate this in SQL, but gosh, that’s slow. What about when we have a few more relationships and a couple calculated fields?

Instead of two flat tables, imagine them as two sides of a cube made up of a number of blocks and each block a (potentially precalculated) value. You can even have n dimensions -- still called a cube but with many more sides than your literal cube. Kylin is certainly not the first implementation of distributed OLAP, but it is one of the first built on modern technology. It's one of maybe two that you can download and install on your favorite cloud provider today.

-- Andrew C. Oliver

Kafka

Kafka is pretty much the standard for distributed publish and subscribe. Will it ever reach 1.0? Who knows, but it's already used in some of the largest systems in the world. Messages to Kafka are reliable like in other messaging systems. Unlike most earlier such systems, though, the commit log is distributed. Moreover, Kafka partitions streams to support a high data load and a large number of clients. Ironically, despite how impressive all of these capabilities are, Kafka is surprisingly easy to install and configure -- an exceptional exception to the rule in big data and messaging.

-- Andrew C. Oliver

StreamSets

You have round-shaped data over here that needs to go in a square hole over there. Maybe that data is in a file (like web logs) or maybe you’re streaming it in via Kafka. There are a number of ways to go about this, but I had an easy time getting StreamSets to do exactly what I wanted it to do, and it seems actually more complete than other solutions (cough NiFi cough). There's a robust and growing list of connectors (HDFS, Hive, Kafka, Kinesis), a REST API, and a pretty GUI to monitor your flows. It's like they wanted to actually solve this problem!

-- Andrew C. Oliver

Titan

Graph databases were supposed to set the world on fire until people started to realize that doing really useful graph doesn’t necessarily mean having to store things that way. Titan sort of cuts the difference. You have a sophisticated graph database with all of the fixings and built with pluggable storage, but essentially pointed at highly distributable column family databases.

Compared to other graph databases, Titan is scale out versus scale up. Compared to strictly graph analytics frameworks, Titan can provide better performance than, say, Giraph but not use the memory resources or time rebuilding a graph in memory required with, say, GraphX. This is not to mention the potential for better data integrity.

-- Andrew C. Oliver

Zeppelin

Whether you’re a developer who simply wants a pretty graph out of Hive or a data scientist who wants a notebook, Zeppelin might be for you. It uses the now-familiar notebook concept made popular by IPython, allowing you to write markup, embed code, execute code against Spark and other engines, and generate output in the forms of text, tables, or charts. Zeppelin still lacks some of the features and multiuser functionality of DataBricks’ product, but it is making steady progress. If you work with Spark, Zeppelin belongs in your toolkit.

-- Andrew C. Oliver