Building Supply Chains for Big Data

For a business making shoes, having a reliable supply of quality leather and other materials is vital. For an organization in the business of selling data products, having a reliable supply of raw data is equally vital. The hidden cost of any reporting or business intelligence implementation has always been in data preparation, so as organizations start to commercialize their data, creation of an effective and efficient data supply chain will increase profitability and reduce lead times when provisioning data products.

At Mo-Data, we build Data Supply Chains that create a reliable process of data acquisition, provide the broad and deep supply of data for data scientists and then establish the infrastructure required to provision it in a compelling and efficient manner to the end customers.

Click on the Infographic to download

Infographics

Stage I - Sourcing

The data supply chain begins with capturing all possible sources of data - from internal transactional and reporting systems, from the public domain, social data and from partners and customers.

The 'data lake' is a conceptual structure that simply recognizes all of the different sources, in their entirety and stored in whatever silo using whatever storage structure. As unadulterated as possible.

In the old world of data warehousing and business intelligence, we built systems that would efficiently answer pre-determined questions. For example, a mobile phone network might ask how much should we bill our customers on a monthly basis, showing them a breakdown of charges by day or by call - so the call duration, time and distance might be retained, but the signal strength during the call would be discarded as unneccessary.

However, in the world of big data and data science, the answers are easy to determine, but the figuring out the right questions is where competitive advantage lives. The same mobile phone network might discover that a subscribers with a weak signal strength over critical periods in the day would be more likely to switch networks after three weeks. Answering such a question would require that signal strength data.

Stage II - Data Science

The data scientist is the inventor of insights - or at the very least the prover of the insights. As such, they will need to gain access to all possible data sources and to be able to request and acquire new sources. The problem for data scientists is how to NOT spend valuable research cycles on cleaning and preparing data prior to analysis - normalizing, dealing with data anomalies (where the data is clearly incorrect), bringing in streaming data and other operations that can be done by cheaper resources.

The data supply chain approach advocates for a number of standard tools, processes and procedures to be put in place - sometimes these are low-tech guidelines for those supplying data to follow - other times these may be NLP processes to add structure to unstructured data streams. The idea is to minimize the cleaning work required by the data scientist to allow them to focus on generating the right insights.

Once the insight has been created, whether that is a trend, an anomaly, an aggregation or some sort of visualization, the value of that insight can be proven in a text environment.

Stage III - Provisioning

At the point where there is an insight of value, the data product manager will be able to pick this up (of course, the original request or suggestion for a data product may have come from the product manager). Their role is to determine the optimal product-market for the insight and to work out how to provision that insight to the end customer (internal or external).

The provisioning maybe how that insight could be consumed by other processes -as an example, back to the mobile network insight regarding signal strength, customers who can be predicted to switch due to ongoing network quality issues at home could be proactively sent signal boosters. Other productizations could be dashboards or new data related services.

The data supply chain will ensure that the new data products could be provisioned to provide reliability, availability and accuracy and to be able to do that cost effectively.