Without going into a Marxist analysis here, the idea is simple: the tools we build to address a given need also end up shaping the way we frame problems — sometimes opening new possibilities, sometimes constraining them. This is particularly true in technical fields, and analytical infrastructures are no exception. The Data Warehouse was initially a straightforward response: centralizing historical data to produce metrics and reports, what became known as Business Intelligence (BI). But as BI became a strategic asset, it also introduced rigid practices, well suited to traditional needs (financial ratios, dashboards), yet less adapted to the emerging requirements of data science (pattern detection, forecasting, advanced modeling, etc.), which rely on different types of data (scope, granularity, volume, etc.). As a result, some innovative initiatives have run into existing infrastructure constraints, governance rules, or tools designed for traditional BI. Trying to “fit everything” into this perimeter often meant exploding costs, degrading overall performance, challenging security policies, and ultimately lowering governance standards. In some cases, data scientists (when they were not effectively hired as data analysts) found themselves isolated, working on local environments, without collaboration or standardization — significantly limiting the impact of their work.
In the era of generative AI, things have not fundamentally changed. Data scientists have sometimes been relabeled as GenAI leads or integrators. But without a broader infrastructure vision, the same causes will continue to produce the same effects.
To better understand and overcome these dynamics, we propose a reflection on the levels of maturity of analytical infrastructure. The term “analytical” is intentionally broad, encompassing both traditional data analysis and data science.
The first challenge is to recognize that needs are diverse, and that successful data projects depend on the ability to address them in a balanced way. It is precisely in this balance — between consolidation, exploration, and modeling — that the true value of analytical infrastructure lies. Each phase will be introduced with a typical use case, followed by an analysis of its strengths and limitations, taking into account the different “points of view” of data users.
1. The invention of the Data Warehouse
While the concept of Business Intelligence dates back to the 1950s and 60s — notably with an article by Hans Peter Luhn in 1960 — it was in the 1990s, with the emergence of Data Warehouses, that tools enabled its large-scale adoption. These centralized infrastructures transformed BI from a theoretical concept into a practical system capable of consolidating data and producing reliable analysis at scale — in a unidirectional flow.
1.a Typical use case
Sales or finance leadership wants to monitor performance. The IT team provides pipelines to synchronize source systems into a centralized database accessible for dashboards. Key information from various systems (ERP, CRM, etc.) becomes available without impacting operational systems. The flow is unidirectional: replication, most often batch-based.
Two types of users emerge:
- the “standard” dashboard user, able to consume reports and possibly adjust layouts,
- the “advanced” user, with enough business and technical knowledge to define and validate new metrics.
Recent BI paradigms emphasize “self-service”. When data is of high quality and well structured, every user becomes a potential advanced user, as the technical barrier to accessing and combining data decreases.
1.b Strengths
The infrastructure is robust and easy to monitor (simple orchestration). The IT team retains full control and can be engaged to make new data available.
- Data quality is ensured through clear governance
- Information is validated and distribution is controlled
1.c Limitations
- Integrating new data is costly: pipelines must be updated
- Self-service is limited to data already present in the warehouse
- Scalability raises questions: what about access to other data domains?
- What about granularity (aggregated data limiting deeper analysis)?
- What about real-time needs in a batch-oriented system?
2. The emergence of Data Lakes
The growing volume of data and associated storage challenges led to separating storage from compute. In 2010, James Dixon, CTO of Pentaho, coined the term “Data Lake” to describe centralized storage for all types of data — including unstructured — making them available for analysis. Hadoop (2005) helped operationalize this vision.
The arrival of AWS S3, a low-cost and easily scalable cloud object storage, marked a major milestone in the development of data lakes.
2.a Typical use case
An industrial department needs ad hoc analysis for process mining, requiring access to machine log data not available in the Data Warehouse. Since the required data is not fully known in advance, and direct access to production systems would be risky, a Data Lake is introduced as an exploration zone. Analysis is performed using data science tools (typically Python). Visualizations are mainly exploratory. Once a metric becomes valuable, it can be reintegrated into BI pipelines.
2.b Strengths
- Exploration is controlled: no shadow IT or uncontrolled exports
- Innovation is enabled through test-and-learn
- Advanced users can share code within structured workflows
2.c Limitations
- Requires strong governance (security, resource management)
- Risk of “data swamp” if poorly structured
- Query performance can be slow without indexing
3. Integration and convergence: the Data Fabric
As data sources and use cases multiply, governance becomes critical. Over the past decade, the concept of Data Fabric has emerged as an architectural approach to unify and automate access, integration, and governance across environments (cloud, SaaS, on-prem) and use cases (BI, analytics, ML, operational systems). 3.a Typical use case Finance wants to reintegrate computed data (scores, lists) into operational systems. The Data Lake becomes a central hub feeding both the Data Warehouse and operational systems. To support this: • data is structured into layers (raw, curated, ready-to-use) • pipelines are industrialized (versioning, orchestration, testing) Data Fabric ensures continuity, governance, and interoperability. 3.b Strengths • Unifies all data use cases • Reduces shadow IT • Enables faster transition from experimentation to production
3.c Limitations • Conceptually attractive but complex to implement • Requires assembling many components • Risk of dependency on hyperscaler solutions and rising costs
- Perspectives This evolution is not linear, nor should organizations converge toward a single monolithic model. Alternative approaches have emerged, often in reaction. The “Data Mesh” is frequently presented as a decentralized alternative to Data Fabric. In practice, it is primarily an organizational model, relying on similar tools but structured around domains rather than centralization. Cloud-native approaches have also evolved, such as the “Modern Data Stack”. Finally, powerful lightweight tools now enable tailored solutions. For example, combining DuckDB with S3 and a Python interface like Streamlit can provide a cost-effective and agile alternative for many small to medium-scale use cases, though scalability and governance considerations remain important.
As Georges Perec (allegedly) said: “To think is to classify.” At the time of generative AI, structuring becomes even more critical. Two complementary dimensions must be considered: • Technical structuring: having data without being able to query it efficiently (catalog, performance, etc.) is often worse than not having it at all. The Lakehouse paradigm represents a major step forward. • Organizational structuring: giving business meaning to data. The semantic layer is a key differentiator for organizations successfully adopting generative AI.
Note: This article was translated and refined with the help of AI tools. The ideas and opinions expressed here are my own.