Data is becoming a cornerstone in software services. Whether it is the business model or it drives revenue or both, tech companies are flocking to use this “free” resource to provide better services and excel over there competitors.
If you are in the “new-sexy” position in computer science or you’re doing research in this field, you will find the resources in this article extremely helpful the same way they helped me. Frontier companies in this field like Google, Facebook, LinkedIn, and Twitter as well as big universities released tens of papers and articles on the subject outlining internal projects they worked on. These projects were released later as open sources to become a stable in the field. To save you the time and pain of getting lost in the labyrinth of endless resources over the internet (the way I did), I compiled a categorized list here for your pleasure. I will try to update the list frequently to keep it up-to-data.
I divided the resources into several main categories:
Bigtable: the terrabyte NoSQL database behind google cloud storage.
Cassandra: the Facebook column-oriented database.
Voldemort: Distributed database by LinkedIn.
Dynamo: Amazon’s key-value store.
HBase: Column-oriented storage over HDFS by Facebook.
Neo4J: the famous graph database.
Snowflake: A data warehouse for the cloud.
The Google File System: the big data file system and the base behind distributed storage in Hadoop.
HDFS: The Hadoop Distributed File System.
RCFile: Data placement for data warehouses used in Apache Hive.
Parquet: columnar storage format.
Haystack: an object storage system optimized for Facebook’s Photos application.
Windows Azure Storage: Cloud Storage System from Microsoft.
Data management in cloud environments - NoSQL and NewSQL data stores: A paper surveying data stores beyond SQL such as Redis, HBase, …etc.
Recommending items to more than a billion people: An article about collaborative filtering at Facebook.
Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.
MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.
TensorFlow: the famous large-scale machine learning library.
Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.
Airflow: a workflow management system by AirBnB.
Oozie: a workflow management system for Hadoop by Yahoo!.
BlinkDb: analytics on large scale data from Berkeley.
FlumeJava: a library for developing parallel data pipelines from Google.
MapReduce: the google framework behind Hadoop.
The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.
MillWheel: stream processing engine from Google.
Photon: A tool to join data streams at Google.
Kinesis: stream processing engine from Amazon.
Trill: incremental data analytics engine from Microsoft.
Kafka: the famous distributed messaging system from LinkedIn.
Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.
SparkR: a Spark library to write processing application in R.
GraphFrames: distributed graph processing with Spark’s Dataframes.
Storm: real-time data processing engine from Twitter.
Heron: the new Storm from Twitter.
Pulsar: real-time data processing engine from eBay.
WTF: the who to follow service at Twitter.
GraphJet: real-time recommendation graph engine at Twitter.
Pregel: large-scale graph processing engine at Google.
Giraph: open source implementation of Pregel by Facebook.
Dremel: analytics system by Google.
Impala: SQL engine for Hadoop by Cloudera.
Drill: An open source implementation of Dremel.
Dryad: a framework to define dataflow graphs from Microsoft.
Tez: an open source implementation of Dryad from Hortonworks and Microsoft.
Kudo: A storage for fast analytics on Big Data by Cloudera.
Google: how the large-scale search engine was built.
The Lambda Architecture: an architecture for a data pipelines.
The Kappa Architecture: an alternative architecture for a data pipelines.
Summingbird: a framework for integrating batch and online computations.
Eventual Consistency: A look at how data consistency works in NoSQL database systems.
Paxos: a consensus algorithm for distributed systems.
Raft: an alternative consensus algorithm to Paxos
Zookeeper: Coordinator and distributed configuration system by Yahoo!.
YARN: resource manager for Hadoop.
Borg: Cluster manager by Google.
Kubernetes: container-orchestration system for automating application deployment, scaling, and management by Google.
If you’ve just decided to become a programmer or you are already in the path...