Decoupled

A selective journal on distributed systems

What is Apache Bigtop?

nosql:

The project founder, Roman Shaposhnik defining what is Apache Bigtop:

The elevator pitch for Bigtop has always been: Bigtop is to Hadoop what Debian is to Linux. The most surprising development to me was how well that message resonates with the commercial vendors in the Big Data space. I’m still amazed at how quickly the “Powered by Bigtop” list is growing.

Original title and link: What is Apache Bigtop? (NoSQL database©myNoSQL)

Using Redis as an External Index for Surfacing Interesting Content at Heyzap

nosql:

Micah Fivecoate introduces a series of algorithms used at Heyzap for surfacing interesting content:

  1. currently popular
  2. hot stream
  3. drip stream
  4. friends stream

All of them are implemented using Redis ZSETs:

In all my examples, I’m using Redis as an external index. You could add a column and an index to your posts table, but it’s probably huge, which presents its own limitations. Additionally, since we only care about the most popular items, we can save memory by only indexing the top few thousand items.

Original title and link: Using Redis as an External Index for Surfacing Interesting Content at Heyzap (NoSQL database©myNoSQL)

Now All Reads Come From Redis at YouPorn

nosql:

Speaking of Redis as the primary data store, this post from Andrea reminded me of YouPorn usage of Redis:

Datastore is the most interesting part. Initially they used MySQL but more than 200 million of pageviews and 300K query per second are too much to be handled using only MySQL. First try was to add ActiveMQ to enqueue writes but a separate Java infrastructure is too expensive to be maintained Finally they add Redis in front of MySQL and use it as main datastore.

Now all reads come from Redis. MySQL is used to allow the building new sorted sets as requirements change and it’s highly normalized because it’s not used directly for the site. After the switchover additional Redis nodes were added, not because Redis was overworked, but because the network cards couldn’t keep up with Redis. Lists are stored in a sorted set and MySQL is used as source to rebuild them when needed. Pipelining allows Redis to be faster and Append-only-file (AOF) is an efficient strategy to easily backup data.

Original title and link: Now All Reads Come From Redis at YouPorn (NoSQL database©myNoSQL)

Bitly Forget Table - Building Categorical Distributions in Redis

nosql:

In the comment thread of the post “Using Redis as an external index for surfacing interesting content“, Micha Gorelick pointed to a post covering a similar solution used at Bitly:

We store the categorical distribution as a set of event counts, along with a ‘normalising constant’ which is simply the number of all the events we’ve stored. […]

All this lives in a Redis sorted set where the key describes the variable which, in this case, would simply be bitly_country and the value would be a categorical distribution. Each element in the set would be a country and the score of each element would be the number of clicks from that country. We store a separate element in the set (traditionally called z) that records the total number of clicks stored in the set. When we want to report the categorical distribution, we extract the whole sorted set, divide each count by z, and report the result.

Storing the categorical distribution in this way allows us to make very rapid writes (simply increment the score of two elements of the sorted set) and means we can store millions of categorical distributions in memory. Storing a large number of these is important, as we’d often like to know the normal behavior of a particular key phrase, or the normal behavior of a topic, or a bundle, and so on.

The Bitly team has open sources their solution named Forget Table and the code is available on GitHub.

Original title and link: Bitly Forget Table - Building Categorical Distributions in Redis (NoSQL database©myNoSQL)

NetflixGraph: In-Memory Directed Graph Data

nosql:

Another open source project from Netflix: NetflixGraph:

NetflixGraph is a compact in-memory data structure used to represent directed graph data. You can use NetflixGraph to vastly reduce the size of your application’s memory footprint, potentially by an order of magnitude or more. If your application is I/O bound, you may be able to remove that bottleneck by holding your entire dataset in RAM. You’ll likely be very surprised by how little memory is actually required to represent your data.

At first glance it sounds sort of a Redis for graph data. Available on GitHub.

Original title and link: NetflixGraph: In-Memory Directed Graph Data (NoSQL database©myNoSQL)

Hadoop and Splunk Use Cases

nosql:

Good post from Splunk about the use cases where Hadoop and Splunk coexist and cooperate:

The Splunk and Hadoop communities can benefit from each other’s strengths. Below are several examples of customers that use both environments.

  1. Splunk then Hadoop
    • Splunk: collects, visualizes and analyzes the data
    • Hadoop: ETL and other batch processing
  2. Hadoop then Splunk
    • Hadoop: collects the data
    • Splunk: visualization
  3. Bi-directional: Splunk and Hadoop collect different artifacts and share the data that Hadoop needs for ETL or batch analytics and Splunk needs for real-time analysis and visualization
  4. Splunk monitors Hadoop

Original title and link: Hadoop and Splunk Use Cases (NoSQL database©myNoSQL)