80+ Free Big Data Resources to Satisfy Your Knowledge Appetite

Whether you want to know how things work under the hood or you keep scratching your head when things get complicated

Posted by Tariq Abughofa on September 21, 2019 · 7 mins read

Data is becoming a cornerstone in software services. Whether it is the business model or it drives revenue or both, tech companies are flocking to use this “free” resource to provide better services and excel over there competitors.

If you are in the “new-sexy” position in computer science or you’re doing research in this field, you will find the resources in this article extremely helpful the same way they helped me. Frontier companies in this field like Google, Facebook, LinkedIn, and Twitter as well as big universities released tens of papers and articles on the subject outlining internal projects they worked on. These projects were released later as open sources to become a stable in the field. To save you the time and pain of getting lost in the labyrinth of endless resources over the internet (the way I did), I compiled a categorized list here for your pleasure. I will try to update the list frequently to keep it up-to-data.

I divided the resources into several main categories:

Big Data Storage & NoSQL

Bigtable: the terrabyte NoSQL database behind google cloud storage.

Cassandra: the Facebook column-oriented database.

Voldemort: Distributed database by LinkedIn.

Dynamo: Amazon’s key-value store.

HBase: Column-oriented storage over HDFS by Facebook.

Neo4J: the famous graph database.

Snowflake: A data warehouse for the cloud.

The Google File System: the big data file system and the base behind distributed storage in Hadoop.

HDFS: The Hadoop Distributed File System.

RCFile: Data placement for data warehouses used in Apache Hive.

Parquet: columnar storage format.

Haystack: an object storage system optimized for Facebook’s Photos application.

Windows Azure Storage: Cloud Storage System from Microsoft.

Data management in cloud environments - NoSQL and NewSQL data stores: A paper surveying data stores beyond SQL such as Redis, HBase, …etc.

Machine Learning and Algorithms in Big Data

Recommending items to more than a billion people: An article about collaborative filtering at Facebook.

Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.

MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.

TensorFlow: the famous large-scale machine learning library.

Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.

Data Processing Systems

Airflow: a workflow management system by AirBnB.

Oozie: a workflow management system for Hadoop by Yahoo!.

BlinkDb: analytics on large scale data from Berkeley.

FlumeJava: a library for developing parallel data pipelines from Google.

MapReduce: the google framework behind Hadoop.

Pig: an engine that supports PigLatin a procedural dataflow language for Hadoop from Yahoo.

Hive (1): A data warehouse on top of Hadoop.

The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.

MillWheel: stream processing engine from Google.

Photon: A tool to join data streams at Google.

Kinesis: stream processing engine from Amazon.

Apache Flink (2): stream and batch processing engine from TU Berlin.

Trill: incremental data analytics engine from Microsoft.

Kafka: the famous distributed messaging system from LinkedIn.

Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.

SparkR: a Spark library to write processing application in R.

GraphX (2): distributed graph processing with Spark’s RDDs.

GraphFrames: distributed graph processing with Spark’s Dataframes.

SnappyData (2): a transaction datastore on top of Spark.

Real-time Processing

Samza (2) (3) (4): Stream processing engine from LinkedIn.

Storm: real-time data processing engine from Twitter.

Heron: the new Storm from Twitter.

Realtime data processing at facebook.

Pulsar: real-time data processing engine from eBay.

Graph Processing

WTF: the who to follow service at Twitter.

GraphJet: real-time recommendation graph engine at Twitter.

Pregel: large-scale graph processing engine at Google.

Giraph: open source implementation of Pregel by Facebook.

Interactive Analytics

Dremel: analytics system by Google.

Impala: SQL engine for Hadoop by Cloudera.

Drill: An open source implementation of Dremel.

Dryad: a framework to define dataflow graphs from Microsoft.

Tez: an open source implementation of Dryad from Hortonworks and Microsoft.

Kudo: A storage for fast analytics on Big Data by Cloudera.

Big Data Challenges and Ecosystems

Google: how the large-scale search engine was built.

The CAP theorem (1): the theory which paved the way to NoSQL databases.

The Lambda Architecture: an architecture for a data pipelines.

The Kappa Architecture: an alternative architecture for a data pipelines.

Summingbird: a framework for integrating batch and online computations.

The Log Problem in Big Data.

Eventual Consistency: A look at how data consistency works in NoSQL database systems.

The Big Data Ecosystem at LinkedIn.

Resource Management

Paxos: a consensus algorithm for distributed systems.

Raft: an alternative consensus algorithm to Paxos

Zab: the consensus algorithm used in Zookeeper. Here is a comparison between Zab with Paxos.

Zookeeper: Coordinator and distributed configuration system by Yahoo!.

YARN: resource manager for Hadoop.

Borg: Cluster manager by Google.

Kubernetes: container-orchestration system for automating application deployment, scaling, and management by Google.


Card image cap
10 Books Every Programmer Should Read

If you’ve just decided to become a programmer or you are already in the path...