Processing of large data -- Architecture, algorithm and visualization (turn)

Abstract: in this video tutorial in the series, Paul Dix to explore the practice of the big data. This tutorial focuses on the first 5 classes of mass XML data across multiple server system end-to-end processing, these data are from the real stock exchange announcement. In the last lesson Paul on the development of visual view, get insights from the macroscopic analysis of large data.

Paul Dix through a series of video tutorial "processing of large data: architecture, algorithm and visualization" is to explore the practice of the big data. This tutorial focuses on the first 5 classes of mass XML data across multiple server system end-to-end processing, these data are from the real stock exchange announcement. In the last lesson Paul on the development of visual view, get insights from the macroscopic analysis of large data. In this tutorial, Paul uses Ruby as the main development language, the use of Javascript in the browser. Moreover, Paul also created a GitHub project, so that everyone can see that he used in this tutorial to command. In this way, people can follow him on AWS cloud services to the last Ubuntu Linux instance to install all the software. Moreover, Paul also has developed all the procedures at the same instance.

Ready to write code

In the first five lesson tutorial, almost all of the code example is the use of Ruby language, this is mainly to from other technology transfer to large data architecture technology can have a smooth transition. Paul says, in some places in the tutorial should use other language instead of Ruby to obtain better performance. However, the use of a programming language in the tutorial can help readers attention big data structure rather than the low-level details. In the last lesson, Paul use Javascript, because it is a natural fit for the browser program.

The first three lesson tutorial describes how to install the commonly used server data architecture of software at. In the tutorial as server software examples are: Hadoop, Cassandra and Kafka. For the installation of each software, tutorials provided very detailed description, the reader can follow Paul a step by step instructions. These software supplement each other, and can run on a single server instance. Paul is using Ubuntu Amazon EC2 cloud service platform in 12.04 LTS m1.large instance. When necessary, additional software and code using the Paul in the tutorial, but these are not weaken the tutorial on the main server end systems concerned data architecture. Additional software with ZooKeeper, Redis and Sinatra etc. A simple software installation

Large data structure within reach

Tutorial can easily complete the selection of open source software in architecture, and the tutorial of data can also be downloaded at Internet. At the beginning of the tutorial focuses on using Hadoop for the treatment of non structured data. The stock exchange announcement data in 2011 September using the tutorial, stock transactions data reader can also use the latest 2013 March instead of. After Hadoop, tutorial introduction to Cassandra. Cassandra is a highly scalable, column oriented database, to realize efficient reading and writing in the large scale network in the database can be. Paul noted that Netflix and Twitter use the Cassandra. Then, Paul introduced to the message system with high throughput and high data structure handling data function. On the main server software on Paul, discusses the machine learning algorithm. This part of the tutorial contains a variety of computing k-nearest neighbors algorithm, and then the prediction analysis on stock transactions label. Tutorial 2 session will last for large data schema integration into the production environment, and the analysis on the macro level, using visualization techniques.

The use of Hadoop for the treatment of non structured data

Of course first of all clear, namely each data in the current issues related to talk about Hadoop. Paul recommends the Cloudera release, because it is relatively easy to install. Doug Cutting is the main architect of Cloudera, he co founded Hadoop the open source framework, this framework can be used as a service cluster running on multiple servers to handle the massive unstructured data. Google, MapReduce and Goolge file system (Google File System) provided the inspiration for Hadoop. The main component of Hadoop including the Hadoop kernel, Hadoop distributed file system (HDFS) and MapReduce. Pseudo distributed Hadoop use tutorial make start component services and real distributed Hadooplike. HDFS provides a cross server elastic data storage system; MapReduce technology provides the data location aware standardization process: read data, map the data (Map), the use of a key value rearrangement of data, then carries on the simplification to the data (Reduce) to get the final output. In the process of Map and Reduce, it can integrate, Hadoop and other software such as database, the message bus (message bus) and other file system. Moreover, extension system has contained a series of technology, these technologies include Sqoop, Flume, Hive, Pig, Mahout, Datafu and Hue etc.

Use Cassandra to store structured data

In the data structure, the main function of Cassandra is to store structured data. Paul in the tutorial DataStax version of Cassandra is recommended as a reference. Cassandra is a column oriented database, it provides high availability and durability service through a distributed architecture. It realizes the large scale clusters, one called "eventual consistency" consistency and provides, this means that at any given moment, on a different server in the same database entries can have differentvalue. Cassandra will eventually data converge to consistent state, but at the same time, from the system in different parts of the request to a database entry reads to differentvalue. Software engineers can through the adjustment of Cassandra, reduce the coherent risk to meet the expected use patterns. The interaction of Cassandra and Ruby implementation of Paul used in the tutorial in two ways, namely, the native gem client and the CQL client gem "Cassandra""cassandra-cql". Tutorial guide readers use Ruby for Map and Reduce, and interact with Cassandra. Paul explained a large data structure, the framework uses Hadoop for the treatment of non structured data, use the Cassandra processing structured data.

The use of Kafka communication

Paul said: "data and event data structure reasonable use message system for real-time processing of the input". In the tutorial, Paul is based on the open source software platform"Kafka"The message system plays in the data architecture in the role as an example. Kafka was originally developed by LinkedIn, now as a Apache project for the maintenance. The Kafka developers originally from a performance perspective the development of this system, if the trade-off between the functional and performance, they will choose Properties. Kafka can be installed on a single computer, but also maintains the producer message system even in the single (Producer), agent (Broker) and consumers (Consumer) Division. Kafka dependent ZooKeeper determine the relationship between the upstream and downstream between agents. Through Ruby gem "Kafka-rb" can get a KafkaAPI, the API can easily be producer / consumer call. Paul to ensure understanding should be made by the message consumers responsible for processing the received message. Moreover, he also explains in detail the use of Redis as an external system algorithm, this algorithm can prevent multiple consumers deal with the same message.

The learning algorithm to predict the result of using the machine

Large data structure to extract useful information from data, and use the model to predict the classification of these data. Paul refers to a machine learning algorithm"k-nearest neighbor(k-NN)", Notice to the forecast and stock trading label. The lecture also covers two related to distance algorithm Ruby implementation, i.e.The Euclidean algorithmAndCosin like algorithm;. Paul points out that, although the Ruby language can be used to explore large data structure, but in the real system should use a more efficient language -- such as the C language. This section contains a lot of practical coding content, including the creation of vocabulary, distance algorithm mentioned before, how will publish the data into a sparse matrix algorithm and complete a prediction bulletin label.

Testing the model in a production environment

Paul in the tutorial through using the "notice" as an example, explained."Cross validation(Cross-Validation)"The meaning and its Ruby implementation. He used the before part of the code, and its integration into the development of a HTTP server, which is called Sinatra Ruby gem. The service has become a tag recommendation system. For Paul speaking, provides concepts and basic ideas for the above code. Paul also discussed how to use multiple data sets to construct a tag recommendation system, in which each data set construction system is a new version. The server can provide access interface of different recommendation system in different versions, but the user knows only the tag recommendation service, and not aware of the existence of multiple recommendation system. In this way, can be real-time switching to a different version of the recommendation system. The site can be different versions of tag recommendation tag data terminal user system are summarized, and the output to the message bus (Message Bus) in the out of band (Out of Band) processing. The message bus collected information is stored in Hadoop, and then use that announcement data database for cross validation to judge the model accuracy. Finally, the model predicted results will be feedback to a new round of model.

Visible in the browser

In the last lesson, Paul talked about how to get high data architecture in which the data results are visualized, and transforming it into human understandable information. Course through the necessary JavaScript explain in detail the tag data show how the time series in the browser. Paul uses the D3 Javascript library to create a histogram and time sequence diagram. Paul based on the D3 library code written by bulletin data "drive" (translator's note: D3 Data Driven Document, that is data driven document). He mentioned the function following D3 Javascript Library: value domain (extent), data mapping (scale), select the HTML element, and a data retrieval.

InfoQ this series of video tutorials for an interview with Paul Dix:

InfoQ: In large data item, which is the biggest challenge facing the company?

Paul: I think there are two of the largest challenges, one is the actual operation, another is learning to program against the new tool. In the case of data are distributed systems, so the actual operation is very difficult. For many companies, this is the first attempt to run the distributed cluster. The distributed cluster is not simply a set of Web server, is not a typical RDBMS configuration of the main, from the server system. Programming time for large data system, everyone is always looking for the MapReduce paradigm for batch processing, or a message system and real-time stream processing model. The model as a standard Web request, the database cache and life-cycle differences.

InfoQ: Release of large data structure in the development and product, how will the "continuous delivery" as the practice of a software development in the integration?

Paul: In a large data environment for sustained delivery is very difficult, at least the price very expensive. Because of the large data, a lot of processing in the small scale data are running very well, but on the whole data set altogether. The only way to ensure the continued delivery of the normal operation of the system, is to check

The right of every thing, you need to have a complete mapping of the system data in the production environment. This will need to run at the same time, two sets of clusters, costly. A compromise approach is to take samples of the production environment data, image data so you don't have to save the production environment. However, the instant that must also ensure sampling enough data, ensure that can solve the performance problem in deploying to a production environment.

InfoQ: In the release to production, assembly of large data architecture should be automatic quality monitoring how?

Paul: "Test driven development to all component (TDD)" is obviously a good practice. However, there are some other points can not be achieved under TDD. The first point is to ensure that the release to production environment before the test performance of large scale data. Another point in only create the prediction model for classification or clustering, when used, is the need to understand the production environment data, the data so that you can constantly iterative model, and compare the performance between the version model.

Author brief introduction video tutorial:

Paul Dix is Errplane CEO and co-founder. Errplane is supported by American start famous entrepreneurial incubator Y Combinator, mainly to provide for detection and monitoring of applications and infrastructure services, and can automatically according to the detected abnormal send alarm information. Paul is the "Service Oriented Design with Ruby and Rails" the author of one book. He often in the conference and user group (including Bacon, Web 2, RubyConf, RailsConf and GoRuCo) of speech. Paul is a NYC machine learning discussion group (NYC Machine Learning Meetup) founder and organizer, has more than 3600 members of the discussion group. Paul served in the Venture Company and Google, Microsoft and McAfee such big company. He currently resides in New York.

View original link: Interview and Video Review: Working with Big Data: Infrastructure, Algorithms, and Visualizations

Posted by Kay at December 15, 2013 - 3:58 AM