written by: ||
Big Data is a problem statement. Not a technology. Hadoop and other technologies are helping businesses to solve this problem.
Every minute there are 1.7 million new likes on Instagram, 300 hours of new videos on YouTube, 4 million new likes on Facebook.
And the world’s data double every two years.
In fact, customers produce volumes of data when they interact and express their preferences and frustrations. And so leading customers extract insights from this data to deliver what customers want, and beat competition.
These large volumes of data of different types, generated so fast, and multiple formats, have been referred to as “Big Data”.
Being able to make sense of big data to make fast and high quality decisions, is one of the secrets behind the success of giants like Apple, Google, Amazon, and Coca-Cola.
But what keeps other businesses from doing same?
In fact, a lot of businesses rely on traditional database management tools such as MySQL, Oracle, etc. These systems can only handle data that is, or can be structured.
But today, over 90% of data generated in the world is unstructured or semi-structured. Such as emails, videos, images, audios, documents, social media posts, and so on.
And so traditional database management tools only handle 10% of data generated by your customers.
Consequently, serious businesses are turning to new sets of tools that can store and process unstructured data, so they can leverage the 90% of insights they’re missing about their customers.
So, in the next couple of minutes, I’ll be helping you understand the big data problem, the tools successful companies have used to tackle it, and how your organization can solve this problem as it transforms.
Firstly, Big Data is a problem statement
Simply put, big data is a term that represents datasets so large and complex that become difficult to process using in-hand database management tools or traditional data processing applications.
In fact, we express the big data problem as that of:
- Volume: storing and processing huge datasets
- Variety: storing and processing different types of data
- Velocity: data is being generated at an alarming rate
- Value: the need to find correct meaning out of the data, and
- Veracity: uncertainty and inconsistencies in the data
Elsewhere you’ll find the same problem expressed with 3Vs, or even 7Vs. But you’ll always find volume, velocity, and variety.
Data is now the new mine. More than ever before. Because insights hidden in data help companies to know their customers, and design quality products and services, exactly the way they expect.
Clearly, the problem associated with unstructured data has never been its rarity. Rather, it’s been the lack of tools and technologies able to extract business value from this diverse and disordered digital resource.
However, old legacy database management systems have shown their limits when it came to addressing big data problems.
Therefore a lot of insights from the original high fidelity raw data end up not being explored. And even this structured data becomes a problem when it becomes huge, forcing companies to create archives.
At the same, processing becomes a serious problem, even with structured data that has become huge. Because traditional tools store and process data in separate locations.
So data has to be queried from storage every time, which becomes slow due to throughput constraints of network.
Fortunately, new tools like Apache Hadoop are helping to solve these problems.
Secondly, New Tools are helping to solve this Big Data problem
Facebook, Twitter, Google, Yahoo, and most big players in the world use Hadoop and its ecosystem components to store and process huge volumes of data generated on their platforms.
Apache Hadoop is a framework that allows storage and processing of huge data sets across clusters of commodity hardware, in a distributed and parallel manner.
Firstly, to solve the problem of storage, the Hadoop system splits data into multiple copies and stores it in multiple commodity hardware.
The Hadoop Distributed File System stores all kinds of data types, irrespective of size and how fast the data arrives.
Secondly, to solve the problem of processing speed, the system processes data exactly where it is stored. No need to extract data to a separate location.
And so this is how large volumes of structured and unstructured data get stored and processed at high speed, to solve the big data problem.
To sum up, with Hadoop and its ecosystem tools, you’re sure of five things.
- Your data is alive forever. No need to archive because storage cheap.
- Your data processing speed is high, as storage plus processing take place in one place.
- You can transform data into actionable insights, using powerful data exploration and advanced analytics tools from Hadoop’s ecosystem components like Pig, Hive, HBase, Mahout, and so on.
- Low risk of data loss caused by hardware failure, as multiple copies of each data are stored in separate hardware.
- You can seamlessly add hardware to scale the system up or down in a flexible manner, no matter how large your data becomes.
Interestingly, this solution turns to be affordable, because commodity hardware used in Hadoop clusters is relatively cheap. In addition, Hadoop software is opensource, meaning you can download it and use for free.
However, you don’t even have to buy that commodity hardware yourself, when Amazon, Google, Microsoft, and IBM have worked so hard so you can rent any amount on hardware in what we call cloud computing.
Further, Big Data can help you build competitive advantage
But data on its own cannot give you competitive advantage, unless you transform it into actionable insights in a well defined context.
This transformation process is what we call Big Data Analytics.
To turn data into competitive advantage, your organization needs to have a clear data strategy that aligns with the digital transformation strategy.
And You just need four steps to achieve this.
1. Have a Clear Data Strategy
This is often easier said than done. In fact companies embark on ambitious programs to develop a new data warehouse without a clear strategy, and consequently end with predictably disappointing results.
In a recent release by Harvard Business Review (HBR), DalleMulle and Davenport suggested a simple framework that most companies have used to design clear data strategies.
Like for every strategy, the objective must be clear. When it comes to data strategy, your objective lies between two extremes – defense and offense – and you’ve got to hit the right balance.
- Defense is when you aim to minimize downside risk. So your primary focus is to comply with regulations on data privacy, integrity of financial reports, fraud detection, and prevention of theft.
- Offense is when you aim to support business objectives such as increasing revenue, profitability, and customer satisfaction. And so your data strategy focuses on generating customer insights, and integrating data from multiple sources to support managerial decision making.
Although it’s challenging to get the right balance, every company needs both offense and defense to succeed.
2. Translate the data strategy to use cases
As you’ll expect, data Defense and Offense strategies have separate use cases. And your position on the defense vs offense spectrum is often a trade-off between the need to standardize and the need to be flexible.
Often, the defense strategy focuses on the following use cases:
- Optimizing data extraction
- Data Standardization
- Secured data storage, and
- Easy access to data
On the other hand, the offense strategy focuses on:
- Optimizing data analytics,
- Data Modeling,
- Transformation, and
The list is not in any way exhaustive. And as I mentioned, you’ll probably pick use cases from both lists, depending on the trade-off you’re making between defensive and offensive strategies.
3. Design an innovative architecture that supports use cases
A McKinsey report revealed that organizations can reduce their IT costs and investments by 20 to 30 percent if they:
- simplify their data architecture,
- minimize data fragmentation, and
- decommission redundant systems.
You can think of data architecture as having two levels: (1) a single source of truth, and (2) multiple versions of that truth.
In fact, the single version of truth is data level, and represents a repository that contains a copy of all crucial data, such as customer, supplier and product details.
Meanwhile, the multiple versions of the truth level results from business-specific transformations of data into information by various units or functions of the organization.
For instance, each department has its story based on how it interprets data in its context, but all these interpretations come from the same single source of data stored in the organization.
And fortunately, Hadoop and its ecosystem have a modular architecture, and all the components fit to flexibly support any data architecture you may want.
4. Set up robust data governance to ensure data quality
The common belief that problems with data quality usually stem from technology issues is misleading.
It is the responsibility of leadership to put in place the right governance that ensures data quality.
Simply put, data governance is about balancing control and flexibility.
That’s to say your business use cases will require flexibility so you can draw relevant insights from data to pursue profitability and other business goals.
On the other hand, you must implement appropriate controls to ensure the data quality complies with regulation.
For instance, in the absence of governance, you’ll likely get into troubles like:
- having ambiguous and mutable data definition,
- having vague data rules that are inconsistently applied,
- not having feedback loops for improving data transformation.
Hadoop gives you a framework to store and process data of any size, but it is your role to control the quality of data loaded to the system. Yes. Garbage in, garbage out.
And to sum up…
We saw that Big Data is a problem statement. And this has been expressed as a problem of data size, speed, multiple formats, inconsistencies and the need to extract actionable information from data.
Then we saw how the Hadoop ecosystem helps to solve this problem with distributed storage and parallel computing, since traditional data management tools have shown their limits.
And since data is like the meeting point of all digital technologies, we explored the importance of having a data strategy in place, how you can design one, and how Apache Hadoop helps you to implement this strategy.
Feel free to check this complete guide to key digital technologies, and how they enable digital transformation.
At the same time, stay tuned to more articles on similar topics.
Want to receive updates on similar topics?
Enter your email below…
100% Privacy ! No Spam