What does a Data Scientist need to know about Data Governance?

One term that has surprised me on data projects is ‘governance’ or ‘data quality’ or ‘master data management’. It’s surprised me because I’m not an expert in this discipline and it’s quite different to my Machine Learning work.

The aim of this blog post is to just jot down some ideas on ‘data governance’ and what that means for practitioners like myself.

I chatted to a friend Friso who gave a talk on Dirty Data at Berlin Buzzwords.

In his talk he mentions ‘data governance’ and so I reached out to him to clarify.

I came to the following conclusions which I think are worth sharing, and are similar to some of the principles that Enda Ridge talks about when he speaks of ‘Guerilla Analytics‘.

  • Insight 1: Lots of MDM, Data Governance, etc solutions are just ‘buy our product’. None of these tools replace good process and good people. Technology is only ever an enabler.
  • Insight 2: Good process and good people are two hard things to get done right.
  • Insight 3: Often companies care about ‘fit for purpose’ data which is much the same as any process – insights from statistical quality control or anomaly detection can be useful here.

Practical considerations are make sure you have a map (or workflow capturing your data provenance) and some sort of documentation (metadata or whatever is necessary) to go from the ‘raw data’ given to you by a stakeholder and the outputted data.

I think adding a huge operational overhead of lots of complicated products, vendors, meetings etc is a distraction, and can lead to a lot of pain.

Adopting some of the ‘infrastructure as code’ ideas are really useful. Since code and reproducibility are really important in understanding ‘fit for purpose’ data.

Another good summary comes from Adam Drake on ‘Data Governance

If anyone has other views or critiques I’d love to hear about them.

Leave a Reply

Your email address will not be published.