5 steps to a modern data architecture

Becoming a true data-driven organization requires adopting a more centralized approach to data architecture and analysis

Modern data systems still mainly process data in batch. The next stage is to move to “real time” technologies and make the entire company operate on an “event” instead of the year, the quarter, and the month.

We have all become accustomed to batch processing by calendar period. However, the world doesn’t work this way. Recently, the market in China underwent a rapid transformation. Any company with plans that focused on that region for the year or quarter had to react quickly. Customers change their minds, world events occur, and nothing happens on your schedule. Everything from money laundering, fraud detection, and promotions are generally handled in batch.

However, there is another way of working. As events happen, you tally them up, and once they go over a threshold you make a decision. With modern data systems we don’t actually have to batch data to make decisions; instead, we can tally and operate on thresholds.

Newer businesses have been moving to “real time” in order to disrupt older ones. Dell famously had little inventory and made computers to order. Amazon has disrupted the entire retail industry through a distribution system that takes orders and puts inventory at the customer’s door in two days or fewer. In both retail and manufacturing “just in time” is a well-sought goal.

The world of financial services is moving to real-time faster than almost any other industry. In this sector you often have no tangible asset that requires complex logistics. This simplifies the move to an event-based system. In other industries it requires renegotiating your relationships with vendors and suppliers to provide goods and services “on demand,” and to be able to adjust your orders or your labor up or down as events require. Moving to real time also mandates a rethinking of older IT infrastructure and a reduction in its complexity.

By adopting a real-time model, you eliminate the need for reporting over large data sets for day-to-day operations. Processing large data sets will still be necessary for historical analysis, but most data will be immediately up to date. Naturally, this will allow the organization to react more quickly to market conditions and evolve as change occurs. Ultimately, this is the competitive advantage.

Step 1: Consolidation

It is difficult to be data-driven if you don’t have a holistic view of your data. Moreover, the agents of this change are not the systems but the people who make the company work. They need to be able to efficiently and effectively use the data. The only way to do that is to bring data together.

This requires creating an inventory of data assets, a central repository (aka a data lake or enterprise data hub), and appropriate views of the data, then mapping data assets and views to business roles for security purposes. Ideally, data governance tools and processes are put into place at the same time.

From a technology perspective, you can use tools like Sqoop and Kettle to feed the data to a Hadoop-based data repository (Impala/Hive). These feeds can be scheduled with Oozie or a similar tool. The consolidation process is best conceived in combination with the analytics process, although without clear use cases consolidation will go out of scope and fail.

Step 2: Analytics

The goal is to make data “self service,” so when someone has an idea, they can go directly to the repository for the data, rather than having to ping IT or the department sourcing the data. Getting there, however, is no small matter: It requires structuring the data.

In general, the process is as follows: Views are created for cross-department or cross-data source reports and analytics. Initial sets of dashboards and reports are created in newer, more attractive formats. An analytics tool is purchased and deployed throughout the organization, and the staff is trained on its use and function.

Both the users and the providers of the data are interviewed in order to get a complete picture of the needs, abilities, and processes around report creation, which will be done using tools like Tableau. This work is best done concurrently with the data
 consolidation as both processes require use cases.

Step 3: Process mapping and automation

Automating a process requires understanding the process. This involves reviewing the company’s source information and interviewing the executives and decision-makers. However, it also requires interviewing the people who actually run the process on a day-to-day basis. The informal process is perhaps even more important than the formal one.

The output from this activity is process mapping, which usually takes the form of a diagram. With many business-process management systems, the tool that generates a diagram also creates a runnable configuration file for a business-process automation tool. After the initial diagram(s) is created, a data systems approach is used to plug in computational processes. This is both human analysis and systems analysis, and it requires some transformation of both. Simply implementing a tool such as JBPM isn’t sufficient; real change is required in how the organization operates.

Step 4: Decision mapping and automation

Process and decision mapping are best done in parallel. The idea is to map items that are algorithmic or numbers-based. Some of these take place informally, meaning someone may look at a bar chart, see if two bars are about the same length, remember that business tends to pick up after the summer, and
 decide to place an order. However, this is an informal system that can be replaced by formal rules. Rules like this can have adjustable thresholds and parameters.

There are multiple ways to accomplish this, from writing rules in a “rules language” such as Red Hat’s JBoss Rules or IBM’s WebSphere ILOG Rules, to creating a domain-specific language and expressing those rules or using decision tables. There are places where actual algorithms might be implemented using R or Python and possibly Spark for in-memory execution at scale.

This activity works best if started shortly after the initiation of the business process mapping; the two are interrelated, and the output of decisions frequently affects process.

Step 5: Governance

Everything can change: the data, the processes and parameters, the decisions, even the rules or algorithms for making decisions. A system needs to be in place to govern the data and establish its source and validity and to manage its structure. Having a centralized data lake isn’t helpful if you look at a field in a table and can’t answer the “What is this? Where did it come from? What does it mean?” questions.

Processes need explanations and stories as a way to suggest, implement, and review changes. They also need to be periodically reviewed to ensure they are not stale or out of touch. Decisions and their parameters (along with any associated algorithms) need a similar review and change control process. In the meantime, executives may change strategies or add new lines of business.

Success depends on having an efficient way to adapt ever-evolving systems. For data there are tools like Hadoop Revealed or Collibra. Although process and rules change tools are currently subpar, common software revision control systems like Git or SVN can provide assistance. The most important piece is getting that data governance system in place.

If you’re at the handmade spreadsheet stage, don’t try to become a real-time, data-driven company in one project. Cultural change is as fundamental as the evolution of technology and eventual adoption. Situations and events that cannot be handled by a core automated process should not be neglected. Instead, address situations requiring human attention, but consider how the process might be adapted. You may find that certain events are not one-offs, but that multiple events are percolating throughout the company without the information being shared.

For more on the evolution from traditional data models to the future of data processing with Apache Spark and natural language processing, see the Mammoth Data whitepaper, “Become a Data-Driven Company in 2016.”

Andrew C. Oliver is a professional cat herder who moonlights as a software consultant. He is president and founder of Mammoth Data (formerly Open Software Integrators), a big data consulting firm based in Durham, N.C. He also writes InfoWorld’s Strategic Developer blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2016 IDG Communications, Inc.