Introduction

After years of projects failures in the IT ecosystem, agile development methods have become a standard that helped reduce the gap between digital applications and their users. Complex applications have been developed for nearly 50 years now, and that (small) experience helped us better understand digital projects. Workloads estimations are now easy for classical features, we can anticipate precisely how many times il will take to develop something. Final user is now in the center of the development process, we can validate regularly (and thus on the long term) that the product matches the user’s needs & expectations. Tools have been developed to formalize & frame the whole process, we can use many models and processes (like diagrams or code versioning) to support and help achieve the agile methods.

However, a sub-category of digital projects has emerged 10 years ago with the reach of a certainly level in computing power. Data-science-based projects (DSBPs) are classical projects that aim to implement a company’s strategy by taking data-driven decisions. For example, if the CMO of a company wants to optimize the ads put in the underground, she may ask for some information from the database like “What is the average basket for people using an underground coupon ?”. This example shows that the project itself may not be particularly digital — it aims to optimize physical ads for a physical shop — but needs data-sciences technologies to achieve it.

Due to the lack of experience in this domain, many problems have raised recently with the explosion of this type of projects. Indeed, the data-science domain and its associated projects have so far been considered more as “experimental”. Companies really relying (like their business model VP) on machine learning or advanced statistics are rare, and most data-science projects are launched with the idea of “let’s try to find something, but if we don’t it’s ok”. Also, the use of computer sciences to obtain results have put those projects under the etiquette of “digital projects”. Unfortunately, many differences between conventional digital projects and DSBPs have lead to several problems making those projects unpredictable.

For now, DSBPs are exploratory projects. Unlike conventional IT development projects, obtaining results using data-science is often done after an exploratory & discovery process. When it comes to developing a website or deploy a network, workload estimates allow us to define clearly the time needed. However, when we want to optimise a neural network to classify users based on dozens of dimensions, the hyper-parameters, the features and more generally the design of the algorithm is not a defined process. It’s made after tries, tests, and may even never reach a satisfying level. There are two problems with this situation. First, we see that it’s really hard to quantify the workload. Second, and because management rarely actually expected results from data-science, no one never really put & tracked deadlines. It turns out those two consequences compensate for each other, but if the second one is missing (in case e.g of a serious project relying on results & predictability), the whole situation becomes unstable. Finally, those problems make the project impossible for the user to be at the center of the project by validating small steps and making sure the algorithm really fits the problem.

Moreover, DSBPs are considered development projects. Since data-scientists use computing tools, management have considered them able to bring packaging around their data-science outputs. By packaging we mean developing the user interface with the underlying mathematics, but also putting the result in production, maybe even managing the digital resources under all of that. That leads to really poor and user unfriendly products, deployed with the bad technologies, not scalable nor secured. This has another consequence that is we try to use digital tools on data problems. We directly use the models from agile development (either technical diagrams or project management schemas) to track a data-science project, we rely on the computer sciences eco system to perform computations, express problems and deliver results.