HPC Today | Big Data – The technologies to achieve Smart Data

Big Data – The technologies to achieve Smart Data

By The Editorial Team | November 09, 2015

Observing the mass of information that we create every day is enough to make your head spin: With over 2.5 trillion bytes of new data, 2.5 billion billion bytes, they gather each day as many information that the human mind was able to produce since its origin.

Sensors integrated in our immediate environment to trade on social networks, through our purchasing habits and online transactions, we participate every second enrich this voluminous common encyclopedia. Big Data is the result of this decision abiding word, an expression denoting the infinite mass of data collected every day by businesses worldwide. But the collection itself is only the sunny side of the mountain to climb. The shady side is the proper analysis of these data, through powerful and modern tools to root out the true meaning and reorient decision to more pragmatic choices in better agreement with customer demands. All business sectors are affected by the phenomenon, science or medicine, fields traditionally behind phenomenal databases, to smaller SMEs operating in a niche market and for whom Big Data is the fuel needed to their business. For example, remember that Facebook is raking every day 500 terabytes of data and Twitter 80MB… per second ! Unstructured, voluble, using multiple formats (text, image, video, audio …), stamped, geolocated and inherently “noisy”, this information must be analyzed to extract the true meaning. Epidemiologists already use Twitter to map the evolution of flu or gastroenteritis, by correlating the keywords and geolocation of tweets. Analyzing this information allows predictive uses : like the famous “trends” of Twitter, companies can identify any type of movements and orientations from the collected data. But the Big Data revolution is not confined to purely external and public sources such as social networks. Companies are also invited to peruse their data sources, through channels of their own. Discover how this set of technologies can provide a substantial competitive advantage.

The Internet and IP networks have now mutated into huge pathways of infinite channels, carrying at any time of day or night unimaginable amount of data. It’s simple: every minute are exchanged 639,800 gigabytes of data on the Internet, including 204 million e-mails, two million search queries and more than 600,000 electronic transactions, according to an IBM study (www. ibmbigdatahub.com/video/bigdata-speed-business). A prodigious source of information on the habits and expectations of consumers, but also on the fundamental changes in human activities! By entering fully into the digital age, businesses and society as a whole produce and brew constantly growing volumes of important data. In plain meaning the information is in the digital economy what coal was to the industrial economy: its main fuel and a phenomenal growth driver for all companies. A recent EMC study indicates that 74% of French companies believe that Big Data facilitates decision-making. They are 47% to find that this set of technologies allows the ascension of the market leaders and 23% believe it creates competitive advantages. But they must still be able to collect quality data and give them meaning by analyzing and fine interpretation in light of business objectives! To some extent, companies are accustomed for decades to the feedback and the performance of their traditional distribution channels to drive their business, but the corresponding data was confined to internal databases, embedded in frozen applications on their own information system. With the rise of the Web, the sources are gaining in scope and diversity. The challenge of Big Data is pre precisely to give them meaning and to optimize decision-making, react almost in real time and enjoying a much more attractive return on investment.

Typical Big Data platforms
Analytical Databases	42,10%
Operational Data warehouses	39,40%
Cloud-based data solutions	39,00%
On premise Data hosting solutions	33,60%
Datamart	30,10%
NoSQL platforms	21,60%
Hadoop & subprojects	16,20%
Other	0,40%

Decision-making data
Big Data demonstrates greater responsiveness and greater flexibility. In a network of ready-to-wear shops, for example, we can realize in light of sales that a majority of customers buying a certain dress goes with a specific shoe model. By bringing the two items on the shelves, it maximizes the effectiveness of cross-selling and increase profits accordingly. In the traditional structure of the sales force, such a decision could be taken only after the tedious manual analysis of receipts or intuition of a local store manager. This means that the dress season is likely to be almost over before ding the right inventory decisions! Using Big Data, it is conceivable to inject all instant sales data in a heterogeneous database, which may include customer tweets (“how does one speak of my brand on social networks?”) their Instagram pictures (“how does one wear or use my products every day?”) and the reactions of their relatives on Facebook (“what is the popularity of my products to the community?”). By crossing requests and effectively questioning this mass of information, through the questioning of matured intuitions and experiences, we obtain in an instant the major trends in order to take much more useful decisions for the growth of the company. But what seems clear in the field of marketing and trade applies with equal efficiency in all sectors. Human resources, health, automotive industry or aviation, utilities, leisure industry, connected objects, media and even internal processes: all these sectors increase their efficiency and productivity through Big Data. Or more specifically, Big Data enlighten their operation and contributes to optimize their performance.

Why Launch A Big Data Project / By Sector
Industry Sectors	Use Case
Industry Sectors	Case Treatment Speed	Combine Hybrid Data	Anticipate Data Treatment	Use Diffused Data	Structure Data	Online Archiving
Finance	28,0 %	15,9 %	18,3 %	11,0 %	14,6 %	12,2 %
Sales	28,3 %	21,7 %	15,2 %	10,9 %	15,2 %	8,7 %
Industry	22,4 %	20,4 %	16,3 %	16,3 %	12,2 %	12,2 %
Public Services	21,6 %	17,5 %	17,5 %	13,4 %	12,4 %	17,5 %
DevOps	22,0 %	13,6 %	13,6 %	13,6 %	20,3 %	16,9 %
Health	20,8 %	22,9 %	12,5 %	6,3 %	16,7 %	20,8 %
Construction	20,3 %	21,5 %	15,2 %	20,3 %	10,1 %	12,7 %

The three dimensions of Big Data
Specifically, the Big Data covers three dimensions that the company must master in order to effectively drive its decisions: the concept of volume, velocity and variety – we speak of the “rule of 3 V”. The volume is firstly the huge amount of data we started talking and are now within the reach of companies. Like the publications exchanged tweets or on social networks, some of them are basically new. But many of them come from sensors and conventional recovery tools such as annual surveys of electricity meters for example. In this specific area, it is estimated that the analysis of such a volume of data helps identify faster, and even anticipate, incidents on the distribution network and to orchestrate a more rational energy consumption. In all cases, it is no longer uncommon to treat a volume exceeding tens of terabytes or even petabytes (1000 terabytes). In comparison, it is estimated that one creates and exchanges more than two zettabytes (two million petabytes) on the Internet annually. The velocity refers to the speed of analysis and decision making. For chrono-sensitive processes, such as anomaly detection and fraud, but also for decisions that have an almost immediate impact on the level of sales (like our example of ready-to-wear shop), companies must be able to analyze data in real time. Finally, the range corresponds to the plurality of information collected. Unlike the conventional process of information processing, which aims to define clearly their field and will be treated together with similar equipment, the data from the Web are unstructured in nature and can include both text as sensor data, audio, video, location-based information and activity logs. By combining these three dimensions, big data allows not only new treatments, through the analysis of data that were beyond the reach of the enterprises, but ensures greater responsiveness decision. Where it sometimes took several days to weeks of analysis to make sense of the information collected, Big Data offers a treatment response in the order of a minute.

Obstacles / By Sector
Industry Sectors	Use Case
Industry Sectors	Shareholders	Strategy	Bad Data Management	Absence of Hadoop MongoDB Specialists	Complexity of Deployment	No Apps Management	Other
Health	32,1 %	22,6 %	15,1 %	11,3 %	5,7 %	11,3 %	1,9 %
DevOps	29,5 %	20,5 %	15,4 %	16,7 %	12,8 %	5,1 %	0,0 %
Finance	28,3 %	22,8 %	16,3 %	15,2 %	12,0 %	5,4 %	0,0 %
Industry	27,5 %	20,0 %	22,5 %	12,5 %	10,0 %	7,5 %	0,0 %
Construction	25,6 %	25,6 %	18,2 %	9,8 %	11,0 %	9,8 %	0,0 %
Sales	25,5 %	23,6 %	20,0 %	14,5 %	9,1 %	7,3 %	0,0 %
Public Services	18,9 %	24,2 %	21,1 %	12,6 %	14,7 %	7,4 %	1,1 %

The end of a myth
However, Big Data technologies do not fall in any of the miraculous solution and it is not enough to store a greater volume of data than the competition to spontaneously enjoy a substantial advantage. The concern in the first place for companies to take the plunge, it is of course the ROI. For storing a large volume of data for processing, not to mention the operations to perform to interpret them, present a serious cost that must be profitable. For example, the offer of OVH for Big Data dedicated servers include a cluster of 48 TB of storage for 1000 euros per month (https://www.ovh.com/fr/serveurs_dedies/big -data). On the side of Amazon Web Services, the supply is more fragmented but its integrated Amazon Redshift service, has a cost of around 1000 dollars per terabyte per year. According to the EMC study, 60% of companies confirm that the budget is the first factor in decision making, to engage in Big Data. They are almost 41% to delay their accession to this new wave of tools, arguing the lack of visibility into the return on investment. Another relatively difficulty to quantify : the ethical questions about the use of the collected information, and the specificity of local regulations on data protection.

Goals of Big Data Projects
Industry Sectors	Obstacles
Industry Sectors	Operational analysis	Operational Treatments	Social branding / Perception Analysis	Relational and Comportemental Analysis
Industry	58,2 %	21,8 %	9,1 %	10,9 %
Public Services	51,1 %	12,5 %	21,6 %	14,8 %
DevOps	50,7 %	20,0 %	12,0 %	17,3 %
Finance	47,8 %	16,3 %	21,7 %	14,2 %
Health	47,3 %	21,8 %	7,3 %	23,6 %
Sales	47,1 %	25,5 %	15,7 %	11,7 %
Construction	32,4 %	35,3 %	13,2 %	19,1 %

Issues and technologies
We are now immersed in a digital ocean, which includes both the data produced by traditional computers, but also increasingly so-called “digital noise” that is to say all data generated bu our devices, such as a smartphone and any type of geolocation traces, but also the data that correspond to the actions you are doing on the web, our presence on social networks, connected objects, etc. With the phenomenon of Big Data, we have the means to record, capture, store and analyze everything, “says Bernard Ourghanlian, technical and security director at Microsoft France, during Microsoft’s Techdays 2015. If companies are beginning to grasp the enhancement of information and trying to get it to help steer their decisions, they must however retain the rule that 3V we mentioned previously, particularly the huge variety of data they are likely to peruse. Faced with these so heterogeneous information, which either include numbers of clicks on a web campaign, but also movies or business newspapers, traditional relational databases are pivotal in order to categorize information.

Hadoop, the technological answer for large data volumes
The technological response is born largely of Web giants like Google. While the search engine was still in its infancy and still faced competition from Altavista, Yahoo, Lycos, or Hotbot, the Mountain View company has developed a series of technologies to store, process and index close to five billion web pages. In 2001, it developed MapReduce, Google Big Table (compressed DBMS) and Google File System (Distributed File System), the three cornerstones of its algorithmic system to display search results. Doug Cutting, the developer of the free Lucene search engine in Java and distributed by the Apache Foundation, focuses on these projects and created the first prototype of Hadoop. Developed in Java, it is an open source framework designed to handle massive volumes of data, in the order of several petabytes. In the manner of the Google projects, it relies on a distributed file management to quickly process a permanent flow of information. Not to lose the battle of the Web, Yahoo! is very interested in such a solution and became the main technical and financial contributor, hiring Doug Cutting and turning its own search engine on this technological brick. The principle of operation of Hadoop is relatively simple. It revolves around the concept of “grid computing”, distributing the execution of a treatment on several clusters of servers. In the manner of Google File System, it introduces its own file system, HDFS (Hadoop Distributed File System), which distributes data storage in the form of “blocks” on different nodes, while replicating in order to preserve non-altered copies. Distribution and management calculations are performed through MapReduce. As its name suggests, this technology combines two functions: “Map” that breaks an application into smaller subsets that lead to as many parts of the final result, and “Reduce” which consolidates the final result from the subsets obtained. Parallel processing helps saving considerable time, whereas traditional databases often use single batches. Through its modular architecture, Hadoop has four essential features to Big Data. First, it solves the problem of the cost of storage. To store more information, simply add additional nodes (as virtual machines, for example) and not to renew the company’s storage arrays – a very expensive proposition and difficult to anticipate. Hadoop is also scalable, that is to say it is easier to spread the processing solution based on the increase in its activity and scalability. Furthermore, by its distributed file system, Hadoop allows bulk storage of heterogeneous data. They do not have to be structured, unlike traditional relational databases, and there is no need to predict their use. Hadoop also ensures a higher safety, by its system redundancy and data replication, and high performance through parallel processing on a cluster nodes.

The boom of NoSQL solutions
As Google BigTable, Hadoop embeds a distributed database management system, HBase, which serves also as a base for Facebook since 2010. It is part of the “NoSQL” movement (for “Not Only SQL”) , a category of DBMS that differ from traditional relational databases in the fact that the basic logic is no longer the table and the matrix representation of information, but the concept of binary document, and that their interrogation do s ‘necessarily carries with SQL. Developed since 2007 by 10gen, MongoDB is one of the most famous DBMS and adheres to the same principle as HBase. It takes its name from the English “humongous”, which means “huge” and he is able to be distributed over any number of nodes, which are added or removed at will. Objects are stored in BSON (binary JSON) without predetermined pattern: it is thus possible to add new keys anytime, without reconfiguring the base. Specifically, the data correspond to “documents” stored in “collections”: the latter are similar to tables and relational databases and documents to different records. Within the same collection, the documents do not necessarily have to obey the same structure or present the same fields. As with the classic JSON, the documents consist of a series of key / value pairs and one can query them with well proven techniques like JavaScript and associative arrays. Here is an example of a typical MongoDB and NoSQL DBMS collection:

{ "_id": ObjectId("2fa8c5db87c9"), "Name":"Johnson", "Firstname":"Laura", "Purchase":"Blue Dress" }, { "_id":"ObjectId("2fa8c6dv87c8"), "Name":"Williams", "Firstname":"David", "Adress": { "Street":"12, Park Avenue", "City":"New York", "Zip Code":"12345" } }

Here, the key (field names) and values (that systematically follow the colon) are not preserved from one document to another. It is even possible to nest within a key document as the “Address” field in the above example. By querying the DBMS with a classical web language like JavaScript, you can enjoy a greater flexibility of treatment. Furthermore, the skills for this type of language are widespread and it becomes possible to “talk to” the data without a heavy expert recruitment process, still too rare on the market. The current offer of OVH and Amazon for storage and processing Big Data revolves around the couple also Hadoop / MangoDB. These technologies are a great ground for start-ups and start-ups, which offer complete and integrated distributions. Among the major players on the international level are Hortonworks, a subsidiary of Yahoo! receiving the support of Microsoft which integrates directly with Windows Server and Windows Azure. Cloudera has received $ 740 million in funding from Intel and MapR, which is based on a native Unix file system instead of HDFS and reintroduced SQL-like queries on Hadoop data. The Hadoop framework has entered the galaxy of Apache projects and is distributed according to the principle of free software, at http://hadoop.apache.org. The main social networks like Twitter, Facebook and LinkedIn, but also Web giants like Amazon and PayPal, are based on this framework.

However, Hadoop is not enough
As with Hadoop and its multiple components, Big Data is not confined to a single technology or technique. This is a trend that drastically transforms all companies and their relationship to information. It is therefore not confined solely to the IT department who must establish the technical infrastructure; there is a transverse upheaval in society, which must be carefully prepared by involving all services and staff, both in the way they consider the information on the types of queries that can be run in order “to talk” to the raw data. In this sense, Big Data has nothing miraculous and therefore does not automatically arise indicators to assist in decision making. We must be able to question the ever increasing volume of data in the light of its own intuitions and the objectives we seek to achieve. Two types of attitudes now dominate the Big Data: operational research data, to grasp the immediate sense and in real time, and analytical research, where we retrospectively view the information as a whole through much more complex queries. Both trends complement each other and are to some extent contradictory: operational systems, such as NoSQL DBMS are capable of conducting concurrent requests and strive to reduce the response time for very specific searches, while systems Analytical facing very large volumes of data, uses treatments likely to take longer. The nature of information queried by companies varies widely between structured forms or not. Therefore, the tendency is to combine technologies and tools to query in parallel a wide variety of content. According to a recent study by IDC, nearly 32% of companies have already realized a Hadoop deployment and 31% of them intend to do so in the next twelve months. But Ken Rudin, the executive head of Facebook, recently said that “for companies seeking to exploit large amounts of data, Hadoop is not enough”. According to the same IDC study, nearly 36% of companies have deployed Hadoop and NoSQL DBMS to complement another type of databases, particularly MPP DBMS (Massively Parallel Processing, conform to SQL) as Vertica HP or Greenplum. To be most effective, the data are well correlated with traditional structured sets and unstructured information from new sources. The offer in the integration and deployment is now very broad and all the traditional players such as HP, IBM, Microsoft and Oracle offer solutions which fall around specific distributions of Hadoop to separate DBMS and proprietary data visualization tools. HP, for example, offers its own Vertica DBMS that integrates bi-directionally with all major distributions of Hadoop including Cloudera, Hortonworks and MapR. Microsoft highlights the HDInsight Services Hortonworks distribution, by integrating it with Windows Server and Windows Azure. The Redmond giant develops in parallel its Microsoft SQL Server DBMS, already widely used in traditional Business Intelligence. Oracle has developed an in-house solution with a Cloudera Hadoop distribution around NoSQL DBMS. Overlays SQL, but also alternative GlusterFS file systems such as Red Hat or Global File System (GPFS) from IBM, are sometimes preferred in HDFS Hadoop for deploying hybrid solutions.

Choosing a Big Data solution
Although still young, Big Data technologies are evolving at high speed and many actors are confronted on the market. It must be said that it is particularly bright future! According to a study by Transparency Market Research firm, its turnover is expected to reach $ 8.9 billion worldwide by end 2015 and known annual growth of the order of 40% in the next two years, reaching $24,6 billion by 2016. IDC completes this view, stating that the area of Big Data services alone should grow by 21.1% per year. The most optimistic studies, particularly those conducted by the firm ABI Research, provide a turnover of around 114 billion dollars by the 2018-2020 period. Therefore, companies can turn to a huge variety of integrated solutions to deploy their Big Data project. Above a Hadoop distribution, they usually include a series of “packages” designed to automate and accelerate the treatments, and propose a set of APIs to develop internal applications in a familiar development environment to plan the execution of queries and also to ensure better data visualization. To choose wisely, companies must think about the nature of the data they already collect or that they intend to store, but above all adjust their expectations regarding their interpretation. Data visualization tools, for example, often differ from one provider to another and can thus assist a public of non-statisticians. As such, market participants agree to add three new “V” to the rule already established: visibility, accuracy and value. The first component relates specifically to the data visualization tools; address high speed of colossal volumes of information is not enough, it is necessary that the decision unit is able to interpret them just as quickly. Truthfulness is a new trend which introduces verification algorithms of relevance and quality of information. On Twitter in particular, it has become essential to separate the useful data from the noise and so distinguish the best reviews of users’ messages from robots. Finally, the value relates to the ultimate fulfillment of Big Data: being able to generate a really interesting return on investment and not be confined to mere technical performance. It is precisely at this stage that future Big Data experts will act. Data scientists, design engineers, statisticians, NoSQL and Hadoop development experts … The market is growing fast and Gartner expects the Big Data should create 4.4 million jobs worldwide by the end of 2015. Young technologies, sustainable roots in academic training and real support government efforts to develop in France: more than a buzzword, Big Data could eventually become one the main business growth vectors.