Since the birth of big data, Cloudera University has been teaching developers, administrators, analysts, and data scientists how to use big data technologies. We have taught over 50,000 folks all of the details of using technologies from Apache such as HDFS, MapReduce, Hive, Impala, Sqoop, Flume, Kafka, Core Spark, Spark SQL, Spark Streaming, and Spark MLlib.
For administrators we’ve taught them how to plan, install, monitor, and troubleshoot clusters. For analysts we have shown them the power of SQL over large, diverse data sets. For data scientists we have shown them how to acquire, clean, transform, analyze, and make predictions at scale using core Spark, Spark MLlib machine learning pipelines, and the Cloudera Data Science Workbench. For developers we have taught them the APIs of these technologies so they can create new applications that were never before possible.
It’s great to know how to program these technologies. But, when and how should you effectively use them? How do you map different use cases and components of an application to these technologies? There are often choices of technologies and deployment models. What are the architectural tradeoffs? Since all of this is so new, how do you make sure that you are creating a sound application architecture?
Certainly by now there is a wealth of experiences, including your own, from which we can all learn. To that end, Cloudera University has organized the Big Data Architecture Workshop. The workshop provides you an opportunity not only to learn from Cloudera, but from others applying big data technologies in a variety of domains. It allows you to contribute your knowledge and experiences.
The workshop is centered around architecting and designing a big data application. It is an exciting and challenging application that combines the Internet of Things (IoT), streaming, near real-time processing, big data analytics, and machine learning at scale. The workshop application is from a domain that everyone can understand. It is an application that would truly change the world.
The workshop format provides the best way to learn about appropriate architectures for challenging big data applications. It allows everyone to share their knowledge and experiences. While architectural principles are presented and discussed for applications in general, the workshop activities apply them to the workshop application.
We divide the workshop participants into teams. Each team typically has five members. Ideally there is a diversity of knowledge and experience on the team. Whenever possible, we separate co-workers. From our experience, the debates that result from this diversity enriches understanding of the architectural principles.
Over three days, we have a series of presentations about general application architectural principles and team activities to apply them to the workshop application. Each team formulates, presents and defends their proposals to the overall workshop.
Day One: Gain Initial Understanding
The first day is organized around gathering inputs to the architecture. The vision for the application is presented. Each team analyzes the application scenario. The team breaks it up into discrete use cases and components using object-oriented analysis and design (OOAD) techniques. This forms the beginning of a logical architecture.
We expect workshop participants to have a working knowledge of some, but not necessarily all, of the big data technologies. On the first day we review big data technologies to fill in any knowledge gaps about the technologies.
We stress the importance of building some of the use cases early in an agile fashion in order to reduce the risk of an unsound architecture. But which use cases should be constructed? We present a methodology for selecting a vertical slice of the application. Each team applies this methodology to the workshop application and defends their vertical slice to the workshop.
The first day of the workshop ends by focusing on the data requirements of a big data application. Ultimately, this section addresses whether an application requires the use of big data technologies. We discuss streaming data for near real time processing, stored data, and the lifecycle and growth of data sets. Each team analyzes the data requirements of the workshop application and presents their findings to the overall workshop.
Day Two: Big Data Architectural Principles
The second day of the workshop focuses on architectural principles of big data applications and applying them to the workshop application.
We first focus on application processing, including near-real-time processing of high velocity, high volume data streams, batch processing, stateful and stateless actions, stored data access patterns, message delivery and processing guarantees, and machine learning pipelines. Each team analyzes the workshop application and characterizes the processing done by the application.
Next the workshop turns to scalable applications. We present design principles of scalable applications, determining if an application will scale, and Hadoop and Spark scalability. Each team identifies potential bottlenecks in the application and proposes how to address those, applying the scalable application design principles.
The next topic of the workshop is fault tolerance. We present types of component failures, graceful recovery and performance degradation, tradeoffs between handling failures and making them transparent, achieving failure transparency via hardware vs. via system software, and stateless and stateful fault tolerant services. Each team identifies the possible component failures in the workshop application for each failure whether they should be handled by the application or be made transparent.
We wrap up the second day addressing security and privacy. We discuss threat analysis, privacy, and zero-trust networks. Each team identifies the security and privacy threats in the workshop application and proposes how to address them.
Day Three: Putting it All Together
By the third day, the workshop has previously addressed how to create a logical application model from a challenging big data application scenario, minimizing architectural risk, data and processing requirements, scalability, fault tolerance, and security. The third day is all about creating a physical architecture and putting it all together into a big data application architecture.
First, we present deployment options for the application. We discuss on-premises deployment and sizing an on-premise cluster. We present the cloud options, both transient and persistent. We discuss the benefits and challenges of a hybrid deployment. Each team proposes how the workshop application should be deployed.
Next, we turn to selecting the technologies to use. We present a methodology and discuss the role of benchmarks and agile development. Each team selects the technologies to be used for the various use cases of the workshop application.
The last topic of the workshop is software architecture itself. We present various architectural artifacts and agile architecture and development processes for producing them.
The final team activity of the workshop is to produce an application architecture for the challenging workshop application. This includes a clear understanding of functional and non-functional requirements, a logical architecture, a physical architecture and a methodology and plan for realizing the workshop application. The methodology defines an agile series of implementation tasks that minimize the risk of an unsound architecture.
At the end of the workshop, you will have experienced when and how to effectively use big data technologies. You will have mapped different use cases and components of an application to these technologies. You will have chosen technologies and deployment models, taking into account the architectural tradeoffs.
All of the architectural artifacts for the workshop application that were produced by each team over the three days are made available to all of the workshop participants. The Big Data Architecture Workshop will have provided you an opportunity to learn from Cloudera as well as others applying big data technologies in a variety of domains.