How is Data Warehousing Different?

While there are many similarities between data warehousing and software development, there are common situations when applying scrum to data warehousing initiatives and projects that may not apply to software development.

We don’t deliver software

I’ve seen several data people read the agile principles and immediately take issue with the term ‘working software’. The agile principles were created for software development, but of course working software is not the focus of a DW. So we adapt: rather than working software, we deliver working fields, tables, views, dashboards, reports, data marts, etc. Therefore, sprint reviews sometimes have to be creative, e.g. if the result of a user story was a new field, then the review for that story might be an SQL query showing the value distributions in the field and the results of testing that show the scenarios accounted for.

User stories in data warehousing are more focused on what many refer to as a ‘slice’ of the DW. Rather than building each layer in its entirety before moving to the next, you take a subset of the DW and build it from beginning to end (or whatever layers are relevant to that story). Slicing initially worries some architects who are uncomfortable with the risk of future rework, but it’s a risk well worth taking.

A common criticism I hear is that you can’t deliver business value with a DW user story that fits into a two-week sprint. This is sometimes true, sometimes you have to put in some foundation work before true business value can be realised. But this is something the team gets better at over time and you get better at breaking epics into user stories that provide some value along the way.

More pre-analysis and prep is necessary

Both in software development and data warehousing, the development team’s job is to figure out how to create the code for the user story. They collaborate with business owners to clarify the details as they go. In my experience, there are more unknowns in this process for data warehousing than software development, and more analysis and prep work is needed for a user story before it can be considered sprint ready.

In their article on Agile Business Intelligence, Heizenberg et al grouped BI user stories into five categories as shown below (the article also provides examples of user stories within each category):

  • Data Disclosure stories are about extracting data from the source system and making it available in self-service BI environment.
  • Data Augmentation stories are about creating new information based on existing information.
  • Data Presentation stories describe the presentation of the information in a format the user can easily understand.
  • Data Validation stories are where a user talks about applying business rules to check whether the extracted data is of good enough quality for the end users to work with, and to take the necessary actions when the data is not good enough.
  • Configuration stories are about enabling maintenance staff and administrators to keep the configuration of the BI system up to date without having to change the code.

I would argue that Data Presentation, Data Validation and Configuration are similar to the user stories you have in software development. However, Data Disclosure and Data Augmentation user stories often have many questions the business owner will have trouble answering:

  • What source system does the data come from? What is the source of truth for the data?
  • Do we have access to the source system already? If not, how (and when) can we get it?
  • What fields and tables are needed? Are they in the DW already? If so, at what layer?
  • Is the data clean enough to use for the intended purpose?
  • In the case of data augmentation, how much do we know about the calculation?
  • Is a prototype needed to help the business owner with the definition?
  • What existing reports/dashboards/etc. are affected?

I’ve seen scrum teams handle the additional analysis in different ways:

  • Pre-Sprint Prep Team: One of my larger clients had a separate team of technical analysts that worked outside the scrum team. Their role was to analyse the high-priority user stories in the backlog to ensure the scrum team had enough information to size them. They assisted the PO getting user stories ‘sprint ready’.
  • Spikes: A spike is a user story or task where the goal is to gather information rather than create shippable code. One of the scrum teams I worked on didn’t have enough analysts for a pre-sprint team, so we would take high priority user stories, that we didn’t have enough information on to size, as spikes. We would time-box the story and the scrum team would investigate the story within the sprint. At the end of the sprint, the story would go back to the backlog to be sized and prioritized or another spike would be instigated if discovery was not complete.
  • Designated Discovery Time: A less desirable option would be to carve out some time from each sprint for discovery. But this uses the scrum team for sprint and pre-sprint work, and always impacts efficiency.