Getting your corporate data ready for prescriptive analytics: data quantity and quality in equal measures
Good news: there’s nothing special about getting your data ready for prescriptive analytics. Bad news: you need to do what’s needed to get your data ready for any type of analytics – and that’s hard work.
Prescriptive analytics is nothing short of automating your business. This was the silver lining as we explored the complexities of prescriptive analytics in our guide. While a lot of that complexity is something line of business and expert data scientists will have to deal with, IT is not out of the equation either.
Prescriptive analytics is hard, and there’s no silver bullet that can get you there without having gone through the evolutionary chain of analytics. You have to get the data collection and storage infrastructure right, the data modeling right, and the state classification and prediction right.
This is the prescriptive analytics bottom line, and IT has to make sure the data collection and storage infrastructure parts are in place for business and data science to do their parts. The data cleaning and organization necessary for success with prescriptive analytics can be thought of along two dimensions: quantity and quality of the data that will be used to feed the analytics.
To begin with, IT needs to make sure all the data pertinent to the organization are accounted for and accessible. This really is a sine qua non of any analytics effort, but it may be more complicated than it sounds.
Consider all the applications an organization may be using: custom built, off the shelf, on premises, in the cloud, legacy. Each of those may have its own format, storage, and API. IT needs to make sure they are all accessible, without disrupting the operation of applications. A data lake approach may be useful in that respect.
And it gets worse. Data may also live beyond applications. Consider all the internal documents and emails, for example. More often than not, a wealth of data lives in unstructured format and undocumented sources. And many applications are also undocumented, unaccessible, and lack APIs to export data. For those, you will have to either get resourceful, or fail fast.
Even where you succeed, however, this is not a one-off exercise. Applications evolve, and with them so do their data. APIs change, schemas change, new data is added. New applications get thrown in the mix, and old ones become deprecated. Staying on top of data collection requires constant effort, and this is a cost you need to factor in when embarking on your prescriptive analytics journey. Adding semantics to your data lake may help.
Speaking of cost: of course, the usual IT provisioning discourse applies here, too. Do you plan ahead, make this a project with predetermined budget for infrastructure and personnel costs, and get it through the organizational budget approval process? Or do you take a more agile, pay-as-you-go approach?
The former is theoretically safer, and more in line with organizational processes. Here’s the problem: Unless your data sources are relatively limited and well understood, and you are very thorough in keeping track and provisioning for them, this approach may be impossible in practice.
The latter is more flexible, but can also lead to budget overrun and shadow IT issues. Without some sort of method to the madness, you may end up spending beyond control, and having your data stored all over the place. Although this is not a 100 percent strict rule, the budgeting ahead approach makes more sense when going for on-premises storage, while cloud storage and development lends itself well to the pay-as-you-go approach.
Finally, data freshness is one more consideration to take into account. If you want your analytics to reflect the real world in real time, the data that feeds it should come in real time, too. This means you should consider streaming data infrastructure. While there are benefits in adopting streaming, it’s a new paradigm that comes with its own learning curve and software/hardware/people investment.