Can we successfully implement Agile in data science?

Agile is about iterative development and delivering tangible products/features quickly, which provides the business with value and ROI faster than a traditional waterfall project.

Consider the example of a piece of accounting software. Overall, it’s going to have 50 features to support the accounts team. To deliver all of the features in a waterfall fashion, it will take 6 months, at which point, the finance team will be able to test the tool and may well find it doesn’t quite meet their needs.

Agile seeks to fix two problems with the above scenario. First, we don’t want the business to wait 6 months to start getting value from this software and second we want to give them an early view of the features, to ensure that each feature aligns with their expectations, rather than waiting until the 6 month project is complete.

How do we do that? We can use Agile which gives us the ability iterate and continuously deliver. So in week 1, we may develop a tool to automate payroll. This is feature 1 of 50 – the business can start using and receiving value from it immediately, while the development of the other 49 features continues.

Why doesn’t everyone use agile for data science?

It’s not as common practice for data science projects to follow the agile methodology as it is for software projects. This is because, data science projects require more investigation, data exploration, feasibility studies, testing and model tuning than your traditional software development project. Moreover, data science projects often go through iterations of failed tests and often consider multiple approaches to solving a problem. In other words, a data science project is non-linear – it doesn’t follow a prescribed format/approach.

Some thoughts on better managing these projects include timeboxing the experimental / research phases of projects to avoid weeks/ months with no tangible project progress. Daily stand-up meetings can be used to keep the team informed around research progress and can be used to collect ideas from the team which may accelerate research. It may even flag that the problem is more complex than expected & the cost of delivery may start to outweigh the business value, so you may choose not to continue with the piece of work. In other words, we must know when to stop.

Daily stand-up meetings also support the alignment of data analysts, engineers and scientists. It’s often the case that the three roles are mis-aligned, they don’t know what one another are doing, which leads to delays during the project (e.g. features delivered to the data scientist based on their research). By joining the daily stand-up meetings and being an active part of the process, we keep all parties in the project aligned.

We need to accept failure in the world of data science. If we are going to fail, we need to ensure that we fail fast. When something isn’t working, a pragmatic view needs to be taken, do we continue? do we change the approach? Again, daily stand-up calls can be used to identify issues quickly.

Finally, we need to accept that simple is not always bad. One thing which is always a struggle is the desire to work with the shiniest, newest technologies. While this is fine, it may not result in a more accurate model and may be a significant time-drain. It’s important to always start simple and work up to a more complex solution if it’s required.

Choose The Right One For You: Kanban or Scrum?

In my opinion, Kanban works better than scrum for agile data science.

The concept of Kanban is that we work from a prioritised backlog of tasks; always selecting from the top, ensures we are working on the most valuable business initiatives.

The concept of scrum is that we assign a list of tasks to a sprint and those tasks must be delivered within the sprint timebox (usually between 1 week and 1 month in length).

The benefits of Kanban are:

  1. Kanban is a very lightweight project management tool, making it ideal for teams with low process maturity
  2. Kanban is not timeboxed, so research problems that are hard to estimate don’t need to be de-scoped from sprint 1 and scoped into sprint 2 etc..
  3. Kanban lends itself to a super collaborative team working environment.
  4. The whole business can visually see progress, making the process more transparent
  5. Adding and removing items from a sprint is disruptive. This is not an issue in Kanban, we simply reprioritise the backlog
  6. The team is always delivering the most valuable pieces of work as they choose from the top of the priority list
  7. Adhoc issues are simply added to the top of the priority list, rather than changing sprint scope
  8. The team has clear focus – most important deliverables are prioritised for them
  9. We can achieve continous deployment – as one story completes, it can be deployed

To execute a Kanban approach successfully, the team needs a similar skillset, so anyone can pickup any of the tasks, rather than having a bottleneck based on a key individual.

We must also limit the amount of work in progress. This retains focus on the most important tasks, leading to faster deployment and extra value being delivered.

The project manager / technical lead must be pragmatic about choosing simple solutions to problems where possible as of course, developers will lean towards the technically most interesting solutions, leading to slippage.

Daily stand-up meetings can be used to review progress. This gives us an opportuntiy to identify those tasks that are not possible or less valuable to the business quickly, without expending lots of effort to achieve little.

The below table is from Atlassian, the creators of Jira. As you can see, there is generally less structure around Kanban, which works well in data science projects where things are just a little bit more unknown. You avoid the issue of things rolling over to the next sprint and breaking your reporting too!

Kodey