Agile Data N’ Info: Agile Data N' Info Podcasts

Fact-based modelling patterns with Marco Wobben

Shagility — Sat, 14 Feb 2026 10:46:07 GMT

Join Shane Gibson as he chats with Marco Wobben about the patterns within Fact-based modeling.

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/fact-based-modelling-patterns-with-marco-wobben-episode-81/

Listen to the Podcast Episode on Podbean

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

You can get in touch with Marco via LinkedIn or over at https://casetalk.com

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Executive Summary

This document synthesizes key insights from a discussion between Shane Gibson and Marco Wobben regarding Fact-Based Modeling —also known as Fact-Oriented Modeling. The central premise is that modern data modeling has become a “lost art,” leading to significant “business debt” where organizations lose the context and meaning behind their data due to silos and rapid staff turnover.

The core solution presented is Fact-Based Modeling, a methodology that grounds abstract technical requirements in “administrative reality” by combining linguistic terms with actual data examples. By focusing on how stakeholders communicate (Information Modeling) rather than just how systems store data (Data Modeling), Fact-Based Modeling allows organizations to bridge the gap between business subject matter experts (SMEs) and technical implementations. This approach not only ensures more accurate system design but also provides the necessary semantic grounding for emerging technologies like Large Language Models (LLMs).

--------------------------------------------------------------------------------

The Crisis of Lost Context: Technical and Business Debt

The current state of data management is characterized by a widening gap between what is stored in systems and what those records mean to the business.

Evaporation of Knowledge: Senior experts with decades of organizational history are retiring or leaving, and the average job tenure (four to six years) is too short to maintain deep context.
Business Debt: This is the cumulative loss of meaning within an organization. When systems are built without documenting the “story” behind the data, the original business intent is lost, leaving IT to guess the context of legacy records.
The Context Gap: Technical optimization (how data is stored) often overrides business representation (how data is used). This leads to “tribal wars” where different departments use the same terms (e.g., “inventory” or “customer”) to mean entirely different things based on their specific departmental needs.

--------------------------------------------------------------------------------

Defining Fact-Based Modeling

Fact-Based Modeling is a methodology developed to capture domain knowledge by focusing on “facts”—statements about the business that are agreed upon as true within a specific context.

Core Components of the Fact-Based Modeling Approach

Grounding in Examples: Unlike traditional modeling that looks at abstract entities and attributes, Fact-Based Modeling uses “data by example.” Instead of discussing a “Customer” entity, a modeler uses a statement like: “Customer 123 buys Product XYZ.”
Binding Term and Fact: By combining the linguistic term with a concrete fact, modelers can identify misalignments quickly. For instance, seeing that one system identifies a customer by an email and another by a numeric ID reveals a transformation problem that abstract modeling might miss.
Information Modeling vs. Data Modeling:
- Information Modeling: Focuses on how humans communicate about data to reach alignment.
- Data Modeling: Focuses on technical storage, structures, and optimization.
- The Distinction: Information modeling is the “primary citizen,” while technical artifacts (SQL, schemas) are secondary outputs derived from it.

--------------------------------------------------------------------------------

Methodology: The Process of Fact-Based Modeling

Fact-Based Modeling follows a specific logical flow to ensure that complexity is simplified without losing essential nuances.

Scope the Domain: Identify the specific problem area (e.g., Sales, Emergency Room, Tax).
Gather Data Stories: Interview SMEs to collect verbalizations of how they describe their work.
Identify Business Constraints: Use interactive questioning to find the “rules” of the data. (e.g., “Can a citizen be registered in more than one municipality at once?”)
Identify Exceptions: Use the data examples to flush out the “edge cases” that SMEs often forget until they see a specific record.
Alignment through Generalized Objects: When different departments use different identifiers for the same concept (e.g., Name vs. Email), Fact-Based Modeling uses “generalized object types” to link these different views into a unified communication framework.

--------------------------------------------------------------------------------

Strategic Value and Modern Application

1. Automation and Efficiency

Fact-Based Modeling allows for a “context-first” implementation. Because the model is rich in semantics and constraints, tools can automatically generate:

SQL for database creation.
Data Vault or normalized models.
Database views that represent the original user stories.
Test data derived directly from the interviews.

2. Grounding AI and LLMs

LLMs are proficient at generating “fabricated stories” but lack organizational context. Fact-Based Modeling provides the “ground truth” needed to keep AI outputs accurate. By feeding an LLM the terms, definitions, facts, and business constraints from a fact-based model, the AI can perform tasks with a much higher degree of reliability.

3. Avoiding the “Generic Model” Trap

The document highlights the failure of massive, pre-built industry models (e.g., the IBM Banking Model). These models often fail because organizations do not know their own “edge” or specific context. Fact-Based Modeling allows a company to capture its unique business logic rather than trying to fit into a generic template that ignores their specific reality.

--------------------------------------------------------------------------------

Notable Insights and Quotes

On Complexity and Simplicity: “If the end product is presented and everybody goes: ‘Wow, is this it? Did it really take you that long... I could have done this,’ then I succeeded in making something very complicated very simple to understand.” — Marco Wobben.

On the collaborative nature of solving data ambiguity: “It is somehow a team effort to slay this beast of miscommunication until everybody agrees and understands each other. ... It’s all about, working together and trying to fight it. What are we not seeing? What are we missing? How do we tackle this? And a lot of that is just human interaction.” — Marco Wobben.

On the Definition of a Fact: “A fact is a piece of data that’s physically represented somewhere... I can point to it. It has been created. I’m not inferring it. It is something that is factually there.” — Shane Gibson.

On Party Entity Data Models: “The most expensive part of our systems is the humans and [understanding] that context... as soon as you design a system with ‘thing as a thing’ and that context lives nowhere else, I now have to spend a massive amount of expensive time trying to understand what the hell [it is].” — Shane Gibson.

On the concept of “business debt” created by rapid technological change: “This is the paradox where business wants to have changed faster. And it ruined the party by saying, we can deliver faster with this new latest tech, but neither party realized what they were losing along the way. So it’s technical debt, it’s business debt.” — Marco Wobben.

--------------------------------------------------------------------------------

Conclusion

Fact-Based Modeling serves as a bridge between the human understanding of business processes and the technical requirements of data storage. By prioritizing the “authentic story” of the business and grounding it in real-world data examples, organizations can mitigate the risks of technical and business debt, ensuring that their data remains a usable, understood asset even as technology and personnel change.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Marco: And I’m Marco Wobben.

Shane: Hey, Marco. Thank you for coming on the show. Today we’re gonna talk about a thing called fact-based modeling, but before we do that, why don’t you give the audience a bit of background about yourself?

Marco: Ah, yes. Thank you. Thanks for having me. Yeah, background a lot of background there. I’ve been around a few decades. I first fell in love with computers when I was still in secondary school. That got me hooked when somebody showed me the break key on a keyboard and we could stop games mid play, change the code and resume.

And that was like, oh, this is magic. I need to figure out how to make this my toolbox. And I enjoyed making software, hacking software working with computers ever since. And. From getting a job onboarding people on Microsoft Windows and Office. Back in the day, I decided to just chase my own career, quit the job and started doing entrepreneur work, made custom software from design to end product for a number of startups. And then somewhere early two thousands, a professor knocked on our door and said, we have this source code of a modeling tool. And the kids that graduated on it, with, they took a different path and we need somebody to maintain it. And ever since early 2000, I’ve been working on fact-based modeling that I had to learn from the inside out.

So that’s where I am now. I’m being considered the expert currently. ‘Cause a lot of professors retired and the young people haven’t caught on for it yet, just and here I am talking about fact-based modeling or fact oriented modeling, if you will.

Shane: Yeah, it’s interesting. We before we started, we talked briefly about the fact that data modeling’s become a lost art. And actually, I think it’s coming back now with all the AI focus, it seems that data modeling is a term I’m seeing used a lot. But, if you think about your career path, that idea that you could start out with games as a way of introducing yourself to, to computers in the early days.

I was like, you I started out. We had a couple of computers at school. The old I think they went green. I think they were amber screens back then. And yeah, again, I got hooked by the games. And I wondered whether that’s the thing that’s been missing is gamifying data modeling.

Like actually making it exciting is, it’s probably the missing thing, right? Is if there was a game that you just happened to date, a model when you did it. Maybe that’s what we needed to make what we do sticky with with people that are coming up in their career.

Marco: that’s an interesting take on it. it’s interesting looking at my users, I’ve been maintaining this information modeling tool for years now, is that there seems to be more. Interest from people that are curious and like doing things right and talk about things. And this is, I can see that in, in a lot of gaming communities where, you know, in the old days it was like you play your game single player with this computer, right?

But gamification nowadays is so much more you can team up and you none of that stuff was there in the early years. Even graphics were not there yet. The communication aspects right nowadays is more and more prevalent and important and it’s. If you look back, what data modeling really is to get the technical requirements of what people actually needed the computer to do in store and how to manage it.

Seeing that come back a little I dunno if gamification would help, but there’s definitely more openness to let’s talk about it. And you can see it a little bit in the agile phenomena where there’s a lot of standups stakeholders, product owners, and everybody starts talking and communicating with each other.

So that’s definitely on the rise. I’ve seen a few data modeling efforts that actually try to gamify it and it’s like a, they have this data modeling tool and it has flashing text and animated tables and, I’m not sure if that’s really the end goal, but I can see how it might help.

Shane: Yeah. I was thinking more about gamification as in the adrenaline rush when you are successful in solving a problem not the flashing things. Because earlier in my career, I worked for a software startup in the accounting space called X. One of the reasons that they became successful was they had this gamification of bank reconciliation.

So pretty much you had bank transactions on the left and your invoices and all that on the right. And whenever a transaction came up on your bank statement, you pretty much dragged it and matched it to the right. And then that road disappeared and it was a very simple gamification process.

But there’s this adrenaline rush going in and seeing your bank rec with 50 transactions you haven’t reconciled, and just going bing, bang Bosch, and it becomes clear. And, I don’t, it didn’t have confetti. Actually that was one of the big arguments at the time was should it have confetti con?

Marco: more like a Tetris road disappearing from your screen.

Shane: Yeah. Yeah, exactly. And so you sit there going, maybe that’s it. So as I think about it more, when I’m conceptually modeling the adrenaline rush is when I create a map that I can show to a stakeholder and they just nod and go, yeah, that’s business reality. I get it. Yeah. That’s how we work.

When I logically model, it’s this idea of yeah, I can take that conceptual model and I can slam it into the modeling Pattern that I use and it makes sense. And then when I take that logical model and I make a physical model and the cloud database actually takes it and I can actually query it fast and it doesn’t cost me a fortune, and any question that I get asked can be answered with that data.

There’s that adrenaline rush, right? That gamification of each of those steps adding value to my life or somebody else’s. I don’t dunno, I just, I haven’t thought about it that way until you mentioned that you started out hacking games.

Marco: That is, it definitely is true for me because, as I add features that support user functionality and all that, and it’s like there’s definitely a rush when I see people use it and they go oh, this is handy. This is practical. This, makes my work easier. Then I get the adrenaline rush. Definitely on the modeling part itself. It usually goes through long cycles of deep talks about the subject matter at hand, which, there’s definitely an adrenaline rush, but not necessarily always the good ones. As a, as an example that I used a lot and some other authors put it in their book too, is I had a long interview with the subject matter expert and it took us about two hours to figure out one requirement, and there was a lot of talk on the type level that created confusion because, yeah, what is a customer really? So I have to get through that. And in the end there was a little bit of an euphoria, which is, the adrenaline in rush, if you will, where we finally nailed it. The subject matter expert re replied, somewhat baffled and in disbelief, and he says, I had no idea my work was that complicated. And then being able to write it down in a way that everybody suddenly understands. That’s definitely a moment of adrenaline and rush, if you will. It takes two hours of hard work. It takes a lot of interviews, a lot of digging. And then finally when you reach that point, and it’s like one of, I remember a quote in a book I forgot which one it was, but I the quote really spoke to me and he this man was speaking about information modeling on a data modeling level.

And he said if the end product is presented and everybody goes. Wow, is this it? Did it really take you that long and work so hard to just present this? I could have done this, and then his reply was then I succeeded in making something very complicated. Very simple to understand.

Shane: And that’s the key is to take that complexity and describe it in a simple way where everybody nods and either agrees or disagrees. I remember one where we literally spent three months with an organization trying to get the definition of active subscription. And the problem was we had three business units.

Effectively three domains that all had different definitions, but couldn’t agree. They actually had different definitions. It wasn’t around the model itself, it was around the plain language description of that term where either we added a term in front of it, marketing, active subscription, finance, active description.

So we were clear they two different definitions or they actually agreed active description was described in this way. And that human engagement was where the time was spent not creating a map with nodes and links on it. Not creating the database, but without actually getting to a stage that we ever agreed or disagreed with that term.

There was minimal value carrying on because we would present a number that people don’t agree with because the definition’s wrong, not the number.

Marco: in the years of information modeling that I, I call it information modeling, but it’s really just a fact-based modeling underneath as well. But I’ve encountered so many synonyms or harmoniums or it’s just, people just completely get confused and it becomes almost like a battle.

But here’s the thing, and I think you explained that as well. It is somehow a team effort to slay this beast of miscommunication until everybody agrees and understands each other. I just recently started watching the latest series on Stranger Things that it’s we have to team up to fight this monster from the underworld, which is, we can’t see it. We know it’s there. And it’s like, how do you fight it? And and it’s all about, working together and trying to figure it out. What are we not seeing? What are we missing? How do we tackle this? And a lot of that is just human interaction.

Shane: And finding ways of taking that expertise that we have that ability to take complexity in a business organization and try and create a map that has simplicity that we can all share. That is a skill, And it’s how do we take other people on the journey without going into a room for six months on our own or creating really complex ERD diagrams with many to many crow’s feet that, few people understand, And terminology is really important. And so it’s interesting that you talk about information modeling and then you also talk about fact-based modeling. ‘cause as soon as I heard of fact-based modeling, I naturally go to dimensional modeling and star schemas because that is where I first heard a definition of the term fact.

And my understanding is you are talking about information modeling rather than anything to do with dimensional or star schemas. Is that correct?

Marco: Yes. That’s funny ‘cause your first take on the word fact is how I met my wife at a data modeling conference in Portland in the USA. She was like, oh, fact-based modeling I’m doing something with data warehousing. I should go to that class to listen what this man has to say.

And it was just not the same fact. So even there, even in it, we have confusion of words, but the word fact really boils down to something that is maybe a little older even where database records really store facts as they happen in our administrative reality as I call it nowadays.

Because there’s a lot of the single point of truth. We need to get the truth out there, the reality and all of those things. But there’s something seriously flawed with that, is that we all perceive from our own bias and subjective reality, the world out there. So there is no such thing as truth. But when you store data and you consider that to be true in your world, then you can state that as effect. Effect as in, I’m writing this down and me and my colleagues agree on it. And so that’s where the terminology came from. Nowadays we trying to figure out, maybe we should call those claims instead because we all say something and we all think it’s true.

And sometimes, especially in data warehousing you collect data from different source systems. Yeah, but you can’t just say that they all state facts because some might be alternate facts. So let’s put it as all the source systems claim a certain statement about this is what happened.

Shane: it’s interesting, ‘cause we’re talking again beforehand about how we’ve both been in the domain for quite a while, but we’ve never really crossed paths. And I’ve heard your name a lot, but I’ve never really read a lot of your stuff. And one of the things I did do, I had to train a new team moving to the data space.

And so I was trying to describe the difference between facts, measures, and metrics. And what he ended up coming up with is a language definitions that I used and the definition I used was a fact is a piece of data that’s physically represented somewhere. So if I go into a database and I see a number.

That is a fact. If I go into a spreadsheet and I see a number or a piece of text, that is a fact because I can prove it existed. I can say, that factually it’s there. I can point to it. It has been created. I’m not inferring it. It is something that is factually there for me. And that’s kinda why I use fact.

And the reason I raised information modeling versus dimensional modeling is ‘cause as soon as I use that term fact, anybody in the data domain goes, oh, you’re talking about a fact table. And I’m like, no. And then for me, I defined, measure as a formula, some of this, that kind of thing.

And then a metric is a complex formula. So this over this based on that at this point in time. And so for me, I didn’t mind whether people disagreed with my definitions as long as they gave me a different definition. But that’s the three terms that I used that seemed to get clarity and understanding when I was talking to people who weren’t data experts.

So yeah, I go back to the true definition of that term. Fact is not a fact table and a dimensional model. And, maybe yeah, should we move to claim or should we just bring back the true definition of that term, that’s one of the problems we have in our domain is we have what we call pedantic semantic arguments about the the most non-interesting things for.

Marco: We are not gonna solve that because there’s so much when I do actually information modeling and we can come up with the word, I’ve used it in, in, in different environments, but we can all agree upon what the definition is for the word inventory. It’s the amount of things that we have and offer certain article, but then you go into different departments.

Sale comes up with three because they already sold a few. Purchasing says, we have eight because they already ordered a few. And then you go into the warehouse and the guy goes but I only have one on the shelf. The, what the hell is going on? No, even though everybody agrees they have different data. And as soon as you, you and I would speak about facts, then, I could say that the customer buys a product and we agree and we call that a fact, but it has nothing to do with the data at all. So the word fact itself is like inventory, is location is like customers that you can apply it to anything and it doesn’t mean anything. So getting too hung up on it. Is very tricky because then you will start tribal wars because no. This is what the definition for a fact is. But the reality is that the word fact the linguistical part of it, the term of it is used in different contexts. So if you wanna use it within your dimensional world, it’s fine. I’m not gonna argue with that. Is similar to calling something red, we will find it in different environments. I’m looking at a book that has a red cover. You look at the fire truck is also red. It’s, oh, we’re all good.

Shane: Unless you’re in a country where the firetrucks green. And yeah, it’s interesting ‘cause Remco Broekmans talks about the example he has of definition of a bed in a hospital. And where one group basically said, it is the metal frame that the patient sleeps on. And another part of the organization said, nah, hold on.

It’s the room where the patient’s located. And so if I looked at the data, I would see one was probably two meters by two and a half meters and the other one was probably five meters by five meters. And then I could say the fact that this beard is five meters by five meters and has no wheels confuses me because I’d expect it to be two by three with wheels.

Maybe that data’s gonna gimme a hint that the definitions are different. So getting into that. can you just gimme a helicopter view of how fact-based modeling works.

Marco: You already gave me some beautiful examples is that to distinguish the things. You also have to look at the data and what I see happening in the data modeling space is usually. The data is not that relevant. We’re all looking at tables, entities, classes and what kind of attributes they have, what columns need to go in and what are the relationships or foreign keys and all.

So there’s a lot of technical views on it, and some of it may be guided by the data at hand. But the data itself is a secondary citizen. And as I just stated with the example of inventory, is it’s only by looking at the data that you start realizing, wait a minute, we’re all calling this inventory, but we’re seeing different things.

So there must be a distinction. And instead of having, this, I call it these tribal wars, these linguistical fights about no, that’s not what I said. This is what it means. And all of that. I call that. Type level arguments, type level discussions. It’s abstract in a way. So what factory ended modeling does is two things, is first, how do I talk about my data? And I use the data in the expressions, so I’m grounding it. I’m not talking about customer buys product. I’m saying customer with customer number 1, 2, 3 buys the product X, Y, z. Where both 1, 2, 3, and x, YZ is actual data, is real examples to ground the discussion.

Because, if I use inventory and I don’t specify that, I look at it in a certain way by giving the data, we will not discover that we might have a difference of perception there. So why the factor oriented modeling or fact-based modeling is that really we need to figure out how. I talk about it, how you talk about it and how we can talk about it and ground it with actual facts, this actual data it’s not enough to say the customer buys the product because you and I would agree, but it’s only by giving the actual data examples that we might start realizing, wait a minute, I have numbers to identify my customers, and you might have email addresses to identify your customers, so there’s something else going on. So it is really digging that one level deeper into it, and not letting that go because in the fact-based modeling, we keep tying the examples data with the language, with the fact types. So it’s always that package so that we can at any point, transform our information models into any kind of artifact, but still show the example data to accompany the definitions, to accompany the structures and to illustrate that this comes with a specific data use. And I think that’s the biggest difference from what you look at data modeling or dimensional modeling or data vault modeling is that it’s all on a type level and it’s usually geared towards how do we structure the storage of data. And it’s less about how do you and I figure out how we talk about the data and how do we quantify with data that we’re talking the same thing?

Shane: What I found interesting was so I’ve used the who does what Pattern a lot, And I think, I probably got it from Lawrence Coors Beam stuff. ‘cause that was some of the stuff I read earlier in my career. And I’ll often talk about customer orders, product from employee in store.

And then I’ve also used data by example a lot. And again, I can’t remember where I found that patent, but I found it valuable, so Bob purchased three scoops of chocolate ice cream from Jane at 1 0 6 High Street in Swindon. And what interests me about the fact-based modeling when I had a quick look at some of your presentations, is it seems like you are combining both those patterns.

So you are combining both the term and the fact into that data story. So you are saying, customer Bob purchased product, chocolate, ice cream from employee Jane at store 1 0 6 High Street.

Is that what you’re doing? You are combining the term and the fact together to give much more context around that business reality to then help you in the rest of the work.

Marco: yeah, definitely. It’s very much that approach and of course tools give you different functions to, to do it in a different way. But the theory that was even developed in, mid seventies and continue to develop all the way up to the end of the nineties was really about how can we write it down in a way that non-technical people can actually read what it means.

And in the interviewing phase, in the workshops, in the explorative areas, is that, as I’ve given the examples earlier is that it’s not enough to just say, okay, we have a customer and we need a data system for that. It’s like we needed to also figure out how you talk about it.

And I think what is an increasingly more difficult problem is that where in the seventies, eighties, even nineties a lot of systems were built for a specific purpose, for a specific department, for a specific system that everybody knew the context of. So now if I’m doing a customer registration, then I’m just doing just that for my department within my group of employees.

And we all know the context. So everybody understands that if I put customer there, they all know what customer is. Now with the increase of it is that. It added a computer on every department, in every system, But they never integrated. So this is where data warehousing came in. But because it was very high context, implied the data warehousing, the BI team now had to figure out, okay, what if we pull the data?

What does it even mean? But we’re really trying to rediscover the context of where it is used. The human aspects of understanding need to be reverse engineered on top of the data structure. And this especially goes for what we’re currently seeing with all the LLMs. Yes, it can give great semantics or storylines, but does it really understand the context?

So this is another gap. And what the fact oriented modeling approaches is that we try to interview and capture the knowledge of the domain and subject matter experts. With all the context, with their language, with their examples and try to line that up with the data, the language and the meaning of some other department that uses a different system. And not just to be able to talk about that specific system, but to also align the communication across departments, across contexts. That’s really what got lost. If you have individual silos that you want to bring together into a data warehouse solution, that the missing piece is the original and authentic communication about their systems in the first place. And then the factor oriented approach allows those subject matter experts to talk about it in a way that they understand and can be verified by colleagues and can be transferred across. Context. And I think that’s the real added value that when systems were built within a specific context, the value of that was not really seen because it served the context.

Everybody knew that. where you see now an increasing in interest in, okay, we need to get back to our ontology, we need to get back to our semantics. We need to get back to meaning, we have to get back to information. So all of that was there and it’s still there if you don’t throw away the baby with the bath water.

Shane: I agree. I, we go through waves, so we see a wave right now outside the AI wave. We see a wave of bi me metric layers. This idea of having a layer where you can define a metric and then regardless of which system, the data’s coming from or you are hitting that same metric’s been applied.

And I go well, you know, years ago in the old BI tools we had end user layer or a universe. We’ve had that Pattern and then we lost it, and then we get it back. I’m also with you in terms of the way systems have evolved. We’ve gone from a mainframe where we had one system where there was a term called customer and there was one fact and that was it.

And then we went to client Server N Tier, and we ended up with seven. And then software as a service, 50 to a hundred. And with the AI wave now and Gen ai, we’re gonna see these one shot apps. We’re gonna see thousands of things that have created for one use case get stood up really quickly, and then probably disposed of.

And so this ability to have a language where regardless of the system and how many we have, we get shared context is really important.

And that example that you used, I just wanna come back to that because I hadn’t thought about it that way. So if I said customer orders, product. I have a thousand systems that involve a customer, an order, and a product.

I have absolutely no context whether they’re defining the same things the same way.

But if I use this data by example, if I say customer, Bob, ordered product, and then in another system I see customer bob@gmail.com ordered product, and in another system I see customer 1, 2, 3, ordered product, and in another system, customer A, B, C, ordered product.

What I know now based on Pattern recognition, is that I have a problem with the unique identifier of customer.

Like I can tell that in 10 seconds by seeing those four data examples and it’s something I know I need to solve.

Because to mash that all up, I need some form of conformity, right?

Some form of either shared identifier or a way of mapping it. And so I can see by this combining this term and this fact together, it gives me the richness to understand the problems I need to solve much quicker than any of the other techniques where they’re separated.

Marco: That’s true and it’s, you’re still some somewhat catching an optimal path because you already assumed in this example that all those systems had something called customer. And in one system it’s probably CST. Then you have to combine it with SAP, where it has X 3 0 4 as a table name and then you have a very abstract table called persons. So you can see where this is going because all of those systems, the storage and the structuring of the data in that system served only one purpose. Optimizing the IT part of it, the IT end product presented the data in a way that the business wants to work with it and wants to see it, but it doesn’t represent how it is stored. The storage and the management of the physical data is technically optimized. It is not with the business representation in mind necessarily. And this is what a lot of people in the data space have encountered too. It’s okay, now I have 200 source systems and I need to figure out what is what.

So is this email address, does that indeed correspond with my customer? In the other system where it’s identified with 1, 2, 3, is that even the same customer? Is it a customer in the first place? So there is so much not just technical debt for systems that are undocumented or not well documented or behind in documentation, but also a, what I increasingly call the business debt is that nobody knows what that meant to the business anymore. And this is the paradox where business wants to have changed faster. And it ruined the party by saying, we can deliver faster with this new latest tech, but neither party realized what they were losing along the way. So it’s technical debt, it’s business debt. In governments it’s even worse because there’s a massive gap between the legal articles made up by politicians full of compromises and loopholes to the actual systems used by government bodies.

And, now we changed the law. Which system did we need to change or vice versa? We’re looking at data here but we have no idea if we’re even allowed to have this data or even be able to look at it because we don’t know the legal articles with it. So everything in it scaled up in the past 30 years so quickly. It totally got outta control in a way where the next tool’s not gonna solve it. An LLM is wonderful, magical stuff, but it’s not gonna solve the real thinking issues. We can do data profiling because we were still not sure if we got the context right. We can do LLM, but we are still not sure if the relationships are correct. So there’s still that gap of knowledge that we lost and somehow need to reintroduce to make things really work.

Shane: And actually that’s interesting around that organizational context and losing that knowledge. Because if I think about it, if I was walking into a new organization and I wanted to understand the context of that organization, my natural technique was to find somebody who’d been there for a long time as a subject matter expert, that person that’d been there for 20 years because they’ve got the stories of how that context happened and why it happened.

And so moving to the uk I had to set up bank accounts. And so I went, and there’s a whole reason that I needed to do a UK domicile bank rather than one of the newer, easier to deal with banks. And so I went to one of the main banks, I created a personal bank account.

Took a while. All good. And then I went back to that. ‘cause now I’m a customer of being identified. In theory. Everything is easy. And I tried to create a business bank account and it forced me to create a new identity. I had to go through the whole identity process, even though I used the same email address, which was my form of identification, my identifier, I had to go and revalidate myself that I have a same address, same passport number.

And you sit back as a customer and you go, that’s just crazy bollocks. And then I talked to somebody that had worked for that bank, that subject matter expert who’d been there for a while. And they said, yeah, but you gotta understand that was two different banks, two different systems, and they haven’t been merged.

And therefore you may think you’ll be dealing with the same organization, but you’re not really. And I’m like, yeah, actually. Okay, that makes sense. Now why? Why did I not understand that working in data so often, but if we go back to that example of term, in fact, so yes, if the term changes, so now I see a customer and I see person and I see prospect and I see X 2 0 3.

If the fact is the same, if every one of those is customer Bob at Gmail X 2 0 3, Bob at Gmail, prospect Bob at Gmail, again, I’m getting more context, I’m not getting an answer, but I’m getting more information that can help me understand a problem to be solved. And as we know with data, there’s so many problems to be solved.

It’s such a complex space. When we hit reality of organizations, the way they work, the terms they use, the systems they use, the way they create and store data. But this idea of binding term and facts. Gives me some more hints because now I can say they’re all the same email address, are they the same term?

And then somebody will say actually no, when you see prospect and customer, it is a different rule and then somebody else will go. But you do know that they can change their email address whenever they feel like it. And you’re like, yep, seen that Pattern before. Okay, so it’s an identifier, it’s a unique identifier, but it’s not a consistent or persistent identifier.

There’s all these patterns and data that we know are gonna hit us, but by binding that term, in that fact, I get some hints at the beginning across multiple subject matter experts. So I can see real value in, in that part of the process early. And so talk me through that, You drop into a new organization.

You wanna start off with fact-based modeling. How do you do it? What do you actually do?

Marco: It’s a funny question. I, a lot of people ask me that, how do you start it? If you open up the books about this topic that either are written as a university. Proof of concept all the way to, self study. It always starts with gather your sources, figure out what is the domain about in the first place.

So there’s very much a almost top down approach where you go okay, we got sales, we’ve got production. so it’s the general area. You can’t just jump into the jungle and start describing all the little insects on the jungle floor.

It’s that’s not how it works. But usually there is a problem domain, there is an integration problem. and. So what needs to happen is that you need to be able to carve out some time with at least some business domain expert subject matter expert to sit down and say, okay what’s the issue here?

What do you, what are you doing? What does it do? And start writing it down. so far, not that much different from actual data modeling, if you will. But the distinction starts with the nitty gritty where you need to get things right. And it’s usually getting things right, not just to verify if you understood correctly, because that’s already a hard part as my two hour example earlier showed to, just get one line of requirements, correct. But that back and forth with language and examples is of great help in getting to actual understanding. But the major part is usually you need to work with a colleague that also needs to understand it. And with the current short-lived career jumps, if you will, is that you may find an expert, but he might be gone in four hours or as is happening currently as well. A lot of seniors that actually know the organization, they’re getting in a pension range and they just leaving the company. And what is left behind is usually short lived career steps, short lived managers that move on. There’s a lot of tempo where at the average job years is four to six years. So the knowledge actually evaporates while we’re looking at it, while we’re trying to document it. and that’s where the difference starts to rise between traditional data modeling that creates diagrams and type level schemas into a, if you compare that with factory and modeling, you have real user stories that you can, with a click of a mouse can pull up and you can read how it’s being used in language in the organization and how it’s being transferred to other departments. And I think that securing that knowledge, that semantic rich document, if you will that is something that I see that gets lost with the more traditional, more technical data modeling. It’s, yes, you have the traditional layers of conceptual, logical, physical. But still, there’s not really a story there. And the people that I talked with and explained this kind of stuff is that they all recognize what I’m describing and a lot of the architects and a lot of the data models will reply with, yeah, that’s what I do in my head. And then my question is, my obvious question is but does it leave your head, does it get written down somewhere so that if you leave, your colleague can take over? And then the answer is usually no. There might be a document somewhere, describes the use case, but that very quickly gets evaporated as well, because everybody starts looking at the artifacts and the technical schemas. So by having an environment where you tie it all together, that you cannot do one without the other. That was really the grounding for the discussions the proof of the pudding, if if you will. But it also gives you the anchors going forward to technical artifacts. So by capturing that from the domain experts, putting that in an information model and being able to generate technical artifacts, even SQL to generate database, it will still comment all that SQL with all the written language and examples. It will generate a database with test data that came from the interview in the first place. It will add database views representing the full user stories on top of the production data, representing the interview. So it really is that don’t throw it away, don’t throw it away. Keep it as long as possible so that everybody is able to understand and read what the actual data is and what it means and how it’s communicated.

Shane: that’s interesting because that is a form of context first implementation. And what I mean by that is with our product, we define the context first and then the technical implementation is hydrated. So I’ll create the context of a business rule, change rule, and then our system will hydrate that into physical tables and SQL transformations.

But that context that I create is the key thing we, we care about, right? That is our pet. The way we deploy it and run it is our cattle. And what we’re seeing in the new Gen ai, LLM world is that context is actually far more important for the LMS and the physical implementation of that context.

And then as I said, we’re I’m working with Juha Coer at the moment trying to write a book around how to concept model. And one of the things we came up with is one of the steps is define your domain scope. So like you said, find a subject matter expert who understands it.

The next one is get data stories. And then after that, identify events, then concepts, and then connections or relationships. And the key thing is, as you said, is when you talk to experts in this process, if they can articulate the steps they take, those steps are often common. They might use different terms, and they might do them in slightly different order in slightly different ways, but we all do it the same.

If we think about it consciously, it’s where we bring in patent templates, where we bring in artifacts that we use repeatably, that we get that repeatability, that ability for that context to be stored in a way that another person can use it.

So in my view, yes, it’s great if we have a system that does all that for us, but we don’t have to, if we just have templates that are reusable, that’s valuable on its own.

If we have a repeatable process that’s not a methodology, right? It’s not fixed, but it’s just a way of working that is valuable because it’s bringing that knowledge back. So I just wanna take you back to something that you said, So if I take this idea of term, in fact having massive value and that we get that with a subject matter expert, and by documenting it in that format.

That context is able to be seen and understood by many people other than us. You then said that it’s also a way of getting alignment. So if I get that term, in fact, if I get that fact-based model for inventory from three different domains,

How do you deal with the alignment problem?

Marco: I can illustrate it by an example. And let’s stay close to the examples that we already mentioned, but it’s more powerful, more generic and more diverse than that. But I think we would need a podcast of another two days that would explain all of the ins and outs, but, so really to just show you how that would work. And I have to for the listeners doing this in an audio only podcast, I’ll try to make it as visual as I can. So when I say Marco Ban lives in Urich, which is the city of, where I live, is that would be effect that you and I can agree on and, me and my wife, we definitely agree upon that.

So let’s consider that effect and the statement that is to hold some truth value there. But there’s something else going on because Marco Warban lives, INTA is a state statement that, my wife lives in nut, my kids live in, so there’s a bunch of statements.

They all express something that we can classify as the city of residence. So by stating multiple examples like that, we can type those kind of statements as city of residence as the fact type. Now within me saying Mark of wo lives inre, I actually embed knowledge in there, even though you and I can agree upon the actual fact. I’m also saying, wait, Marco Ban is the citizen, is the city. But it goes a little deeper because my citizen is not really a citizen. It is just how I identify a citizen. And in that identification I can see, wait a minute, there’s a first name and there’s a surname. Similar with the city of Urich. It’s not really a city, it’s we’re representing the city by storing the name of the city. So you can see that if you can visualize that there’s knowledge, almost like a graph in there where I started with the city of residence. I gave it the semantics lives in, but it also has structure. It has the citizen, it has the city, it has the first name, the surname, the city name, and all of that ties together into this single fact statement. Now I can populate that with different examples. I can give my wife a position in there and my kids. And so it is populated with all kinds of example data and then the subject matter experts is then post with a series of questions. Could it happen at the same time that Marco lives in re as ma Marco lives in New York?

And then he would probably say, no, there’s something wrong with that. By going through these interactive sessions, which is almost like a gamification of the interview if you will, is you discover the business constraints and then by discovering the business constraints, that will lead to a certain structure.

When it comes to data modeling. If I can live in only one city, then the city is probably an attribute in the citizen’s table. If I can have multiple cities where I can live because that’s allowed in our register, then I would probably need a linking table in the end in physical model. So these constraints steer how the data is structured. And this is the interesting byproduct and I’m not sure if I’m still on track of their question, is that. In the interviews with subject matter experts, we can find ourselves very easily in hours long sessions about, what is a citizen. But as soon as they cannot give me a proper example to illustrate what they’re talking about, I’m talking to the people that are working outside of their scope of expertise. So that helps steering that. So there’s an organizational alignment in my efforts to find the data, illustrating the information. Now, the alignment on the other hand is now I start identifying Marco at some Gmail address lives in. So now I’m identifying the same citizen, but I’m using an email address. Now, obviously, in, in official organizations and registers that would never do, but, let’s suppose we have a small tennis club and it’s fine, so what I find now is that I still speak of city of residence. I still speak of citizen and city name, but I don’t have first name and surname.

So suddenly that citizen, where it used to be first name and surname now is an email address, but it still identifies a citizen. So what happens in the information grammar, if you will, is that there is something introduced called a generalized object type. My citizen can either be identified by first name and surname or by email address. So in the modeling part, it’s very easy to find statements that either generalize, which means that it allows different ways of identification for it, or different ways to talk about it. I now, I am presented with a data challenge because which combination of first name and surname goes with which email address. So again, I need a different fact statement that now would introduce, Marco W has an email address called Marco such and such at Gmail, which links the two keys. So nothing in my communication has to be altered to support alignment of different ways of identifying it, doing data mapping, just as part of the verbalization. I have a department here that only works with names. I have a department there that only works with Gmail addresses. I can make them talk to each other ‘ cause that needs to happen sooner or later. And the way that they talk to each other is say yeah, you’re right. This market woman corresponds with that email address because I have proof of that.

So then suddenly you have a mechanism introduced in the communication on how to identify certain entities in your data administration in a unified way. So this is different examples on how these alignments work on both semantic level, on a data level, on an identification level. In short.

Shane: I was thinking about it slightly differently, but it’s exactly the same process, I think. So let me play it back to you where I think I got to. you are talking there about, okay, we have this term and the term has the same context definition but the facts that identify that term are slightly different, right?

And we can then do some alignment where we see these different identifiers.

Marco: Yep.

Shane: The one that I was thinking about is where a domain or a business unit has a completely different definition for that term. And another one does. And we identify that. So let me give you an example that’s real for me right at the moment around citizenship or residency.

Marco: Oh.

Shane: So I think I know what made me a resident in the uk. It’s when I got a certain kind of visa and I entered the country. And as soon as I did that, that’s the rule that tricked the tick in the box that I am a resident of the uk but when I deal with tax residency, it’s a different set of rules.

And so they’re slightly different, I have to live here, but then also I have to make sure that there’s certain things I don’t do anymore back in my original country of tax residency. And so they’re both residencies, but they’re slightly different. And then how would I articulate that using term and fact, How would I fact based modeling it. And the thing that you talked about is this idea of business constraints. If I can describe the business constraint of what our residency is versus the business constraint of a tax residency, again, now I’ve got two patterns and I can look at those two patterns, those two constraints and say they’re the same or similar or they’re very different.

And in this case I’d say they’re different, One is around how my financial transactions work and one is where my ass sits when I have breakfast for the majority of the year, ‘cause I can spend a certain amount of time outside the country and I am still a resident of this country.

And then I think the other thing that this Pattern gives us, which we know happens a lot because we see humans do it, is as soon as I give somebody those terms and facts. And a set of business constraint, a set of descriptions of how they behave. Humans love to point out exceptions, especially subject matter experts.

It’s yeah, you could be a tax resident in the UK if you do that, but actually if you do this one other thing that nobody ever tells you about, actually you are not That’s the exception. Oh. And by the way, system A knows about that, but System B doesn’t. And humans are really good at pulling out that.

I dunno, what would you call it? The knowledge that only they have. And like you said, when we used to work for organizations for 25 years, that knowledge state in the organization, now you’re lucky if it’s five. So that context, that knowledge of those exceptions disappears. I can see how this idea of defining or articulating business constraints, identifying exceptions, finding them to that term, in fact Pattern in an artifact that’s repeatable.

We can now shortcut, like you said, the need for a data expert and a subject matter expert to spend a week in a room going through this magical process that only two people can do. And make it a little bit more, not democratized, but a little bit more accessible and repeatable.

Marco: it’s an interesting example, but again, it’s, it touches the exceptions. Again, your example in itself is an exception. And it’s, it is too much to say about exceptions. But even talking about it as you just did from a tax context, Shane, the citizen dot. So even in our semantic and in our language use, oh, we already start distinguishing that. And then it comes back to do we identify the citizen in a text context the same as the citizen in a legal context. And to tie it back to my previous example, are those two the same citizen? And all three of them are different domains. There is the text domain, there’s the legal domain, and there is the domain where we might want to match if this is the same person, for whatever anti-terrorist rule organization or whatever. So there’s different domains and every domain has different rule set. So blankly calling them all citizen because in mentally we can say, yeah, it’s the same person, it’s the same citizen. But data administration wise, the data administration of identifiers for a citizen are not the same as the data administration for identifiers for a citizen in a different context. And I think again, that’s the big separator between and it causes a lot of confusion in interviews and workshops, is that. We humans unify that because we see that the reality, it is about one person. But what distinguishes that is that we need to talk about how we talk about the data. And that is not the same thing as the reality. And by separating those out, it makes it, in a way, it makes it easier to say, okay, but are we talking about the same data administration?

No. Then we’re separating those out and if we are talking about three different domains that need to come together in one data administration, then we need more semantics. Distinguishing the three.

Shane: a couple of things I want to loop around there. So again if we go back to the term actually, there is a citizen and a , physical residency and a tax residency because I’m actually a citizen of New Zealand still, but I’m resident physically in the UK and my tax residencies coming with me. But again, it’s not until we, we start using term and facts together that I can articulate that those are three different things, and it’s the business constraints and the exceptions that tell me there are different things. one of the things that I find interesting is this idea of focusing on the complexity and modeling that, is the area that’s gonna cause you a problem.

So if you ever do data bulk training with Hans and WinCo Brockman, they’ll, and I dunno if they do it anymore, but they used to talk about Peter the fly. And so the example would be if I am sitting in this ice cream store and I see Bob come in and order that ice cream and there’s an order id, So I can say when I put term, in fact, there’s a an ID in there of 1, 2, 3, 4, which is the order number. And if I want to know whether they actually got the ice cream. Is a separate part of that process. And on Peter the fly, I can see the ice cream being handed to Bob. So I know it happens if I’m in the room, but if I’m looking purely at the data, at the facts, I see nothing because there is no handover,

there’s no delivery ID that they got the ice cream. So I’m gonna have to infer that if there was an order and no refund turned up, that I’m inferring that happened, but it’s not a fact. And so I think, again, those terms and facts tell our stories and then as data modelers we can look at the exceptions that we know we need to worry about.

But I wanna go back to this idea of domain because it’s one that I struggled with a lot and I still do is a domain boundary can be anything. It can be a business unit, it can be a series of core business events. It can be a process, it can be a use case, it can be a team topology it, a domain is just a boundary where you say it’s in that boundary or it’s not right.

It’s in this domain. It’s in that domain. And I hadn’t thought about using the term fact and business constraints as a way of defining boundaries. For domains. So if I see, a term, in fact that’s all around my citizenship and I see another term in series of facts and business constraints around my residency and I see another one around tax residency and they are different, then I can use those as domain boundaries.

They might be too granular for the intent that I want to use it for. But what we’re saying is they are different. You write them down as words and numbers you can tell they’re different. So that gives us a form of boundary. And so that’s really interesting because it gives us a Pattern,

it gives us a formula of how we can say that this sits in boundary A domain A, and this sits in domain B because of these rules. And I find that really intriguing and valuable ‘cause it’s solving a problem that I’ve been trying to solve for many years. ‘cause it’s annoying me.

Marco: Yeah. It’s, it is. So there’s two you’re right there’s two big areas where you can see there’s a separation of domain. And so rules is one of them. It’s in a hospital systems for example, you might wanna need somebody’s birthday to be able to put it into the computer.

As we had this kind of operations, these treatments, it needs to go this to his insurance company and all of that. But if you’re brought into the emergency room unconscious, you have no idea on you. You’re still being registered as a patient. So the rule is entirely different, but yet we have a patient number somehow. And the definition of the patient is clear across the whole hospital. So rules determine some sort of context. Or, if you come in with an appointment, they probably know your birthdate. If you wanna send it to the insurance company, they have to have your birthdate. The emergency room doesn’t really care. So how do you align all of that? So then you see, again it’s we’re unified in the language, we’re differentiating on the rules, and somehow we have to integrate systems. The most important part is that independent of which system is being built, we need to figure out how we communicate first.

So that language always goes number one. And that’s basically the what fact-based modeling does. It allows people to talk about it from their context, and then given that specific context, you add specific rules for that context. The example where can I live in multiple cities, municipality wise, no, you cannot. You have to be registered in one municipality. Now, in a more generic way, you can have people registered to different places. Sure. I have an office here, I live there, I have a vacation home there. And it’s yeah, different places. But on a municipality level where it becomes more context specific, there is a rule that says you can only register in one municipality. So there’s that too. There’s, there is a way to generalize in a more abstract way. If Ft has been doing that for years, where they introduced the party model, is it a person, is it an organization? Is it, we don’t know, is let’s call it party. We’ll give it an artificial key and we’re good. That’s just the way to keep the IT system running. It has nothing to do with semantic or integration or alignment or whatsoever. It’s just a technical solution because we didn’t get the business meaning in the first place, or we are not specific enough to a specific context. So these problems will always occur. And I think that separates the data modeling from information modeling is where the data modeling is. Do we find alignment in how we need to build the system? Whereas information modeling is, do we find alignment on how we communicate about the data?

Shane: so again, you are making a context differentiation

between the audience. That’s the consumer of what we are creating. And you are saying if it’s stakeholders who aren’t data experts or IT experts, then we’re information modeling. If it’s data people or IT people then we are data modeling,

And the ability to have both languages, but then a mapping and sharing across those languages of where the real value is. And so like you, I’m not a fan of thing as a thing. I don’t agree we should ever use that term in our information models. We should never use party entity thing as a thing.

A thing is associated with a thing as a generic way of describing context to a stakeholder. I also don’t think we should put that in our technical systems. Years ago when we had mainframes and we had no memory and we had no disc and the infrastructure was expensive yes, it had value to us. Right now, the most expensive part of our systems is the humans and understand that context.

And as soon as you design a system with thing, as a thing, and that context lives nowhere else, I now have to spend a massive amount of expensive time trying to understand what the hell, how many things you have, how many things they are related to, and how many things those relationships are.

And that is an expensive piece of work. And I just can’t justify doing that anymore. But as you can tell, I’m slightly opinionated on that one. And it’s also, if I come back to the way we do that information modeling the grain of it, The detail we go to is interesting

because if I do it just based on terms customer orders product, it’s very different doing it based on terms and facts customer bob orders ice cream

Marco: Yes.

Shane: And so what you are saying is actually you are bringing more detail, a higher level of grain into that information modeling process earlier because it has value, so you’re doing more work upfront because you’ve found ways of taking that work that’s done early and automating some of the downstream work in terms of the technical implementations.

And so bringing, from an agile point of view, you are doing work upfront work in advance, which could be waste, but you’re found a way of taking that work and reducing the waste further down the value stream.

Marco: Yeah. Correct. It’s really that, and for those who are interested in this a lot of the information can be found on the website called casetalk.com. What it shows, and this was developed in in, like I said in the seventies, all the way up to 2000 is it’s not just, oh, how can we capture the language?

But it was really, how do we talk to the domain expert to come up with the appropriate IT system? And I think what happened really is that it is such a rich and detailed environment where people modeling traditionally with business knowledge in their heads doing the IT itself, which was very close bound to, organization. When the first computers came in, it’s like it was usually the domain expert that got an IT training. They knew the context, but nowadays it scales up so much is that you are an IT professional. You have no idea about any business. You just, your business is it. So getting back on that and trying to find that alignment with business is, it is of increasing interest, but it’s not the majority.

A lot of new young professionals base their efforts on the tools and the ability of tools and not as much as are we doing the right thing for the business because they don’t really care. They were not educated as such. There’s a massive gap there that where business looks at it is like you’re, you are the expert, so you tell me and then there’s this rift and. You mentioned data Vault a couple of times where, traditionally the hub is the natural business key, right? It’s not a technical key in the source system. So what is the business key? Then you have to talk to the business. And the rich information modeling does with all the fact base is that already mentioned Marco Ledge and Nutra.

So I already have the natural business keys right there. And with a push of a button, it can then be generated into a citizen table and a city table and even have artificial keys with minor annotations like city of residence. We might want to keep a log in time, we might wanna have history in there, but also when is he planning to move to the next city so it becomes bitemporal. And these are very simple flags in the information model that can be generated to a data model that says, did you want a data volt model or a normalized model, or did you want a adjacent schema? So the physical parts become automatable. The data model becomes automatable and having all the semantics and verbiage and examples, it makes it verifiable and readable by the business. And so that really what it ties in and some experiments show that precisely that combination is the real power to keep LLMs grounded as well.

Shane: I definitely agree on the LM front. We know that if we pass at the terms definitions of the terms, the facts, so data examples around those terms and their relationships and natural language business constraints into those lms, we get a much better response when we want to do a task.

Marco: Yep.

Shane: So that context is really important.

And that’s why, it’s interesting watching all the, bi semantic layer vendors try to say they’re context engines and you’re sitting there going, but you’re just at the end of the chain, you’ve just got metrics and maybe some definitions, but the richness is all sitting on the left of our value stream.

It’s all around problem ideation, discovery design. It’s not physical implementation of our consume layer for your cube. Yes, it’s got some value And so I’m in really interested to watch the reinvention of the market yet again. ‘cause as you said technology is often what people get taught.

A number of times I’ll talk to a data science student who’s been taught Python

And you go back and you go, but. How do you understand the business problem? How do you understand the value you’re gonna deliver if you build that ML model? And we’ve seen, as you said, we’ve seen over time data teams that don’t add value don’t survive.

Marco: True. And I remember an interview that was told that we’re, and not to talk them down, but it really points at the lack of education as well, or, the business debt and the technical debt. It’s just people are being trained in tools and in technical approaches where some of the analysts really said, what is your highlight of the day?

Is that when they discover insight, it’s like insight. Should it not have started with the insight as documentation, or what are we really doing instead of let’s discover it. That is, like you said it’s the end of the chain while everybody shouts. Shift left shift left. It’s yeah, why?

Because that’s where it started and that’s what we lost. From my perspective, it just starts with can we talk to each other and are we writing that down? How we do that? Because that’s what we need to do in the end. And whatever it system we built, whatever dashboard is being built, it actually serves to better communicate about what we are really doing. So it all ties back to that. And yeah, there’s a never ending story because the LLM is the next silver bullet. It’s gonna solve all our problems. And it’s not, it’s helpful, but it’s not the silver bullet. ‘cause we still need to get the authentic story, not the fabricated story.

Shane: Yeah it’ll help us understand where the context differs but it won’t help us actually define what the context is.

Marco: Yep.

Shane: At the moment. from the stuff I’ve done with it. Oh maybe the, the next generation a GI version will, or, maybe we all end up with standardized context that every business follows.

But we know that’s not true either, right? We saw that with tools like SAP, where it’s implemented vanilla and then $50 million and five years later there’s a bastardized customized version because that organization is different with air quotes.

Marco: and I can illustrate that by fairly simple examples is that, I mentioned that somewhere else. It’s like there’s not a bank in the world that I didn’t purchase the IBM banking model at the same time I dare any bank in the world to actually have implemented it. and that points to a massive gap is that, to be able to implement it, you first have to know what you already do. Which is a massive dilemma ‘cause nobody has the information models and then you are presented with a technical model, like the banking model that says, everything will fit in here.

It’s okay, but what do we have that actually fits? So there’s a massive dilemma there. And then in the end every bank will try to do it a little bit different. So they have an advantage over the competition. So they don’t really want to confirm to one standard, which points back at. There is no real universal Pattern because everybody tries to give it their own little edge, which means you have to capture that edge, not just build something that you think will fit.

You have to be very specific and in that I’ve seen systems database was designed with rigor and the software was developed three times over, but the database didn’t change. What happened is. A new wave of tech came, it used to be mainframe, then it became Windows, then it became internet, then it became mobile Data did not change the way the technology work, changed the organization, changed the business processes because, we’re not having a physical storefront for the bank, but we now we have a mobile app. So the process of working with the data changed, but the data itself didn’t change. So there’s a lot of dynamics going on and it really shows the importance of getting the data right and don’t throw away the story that came with it so that you can actually reinvent all the processes and all the software on top of it without losing meaning. But it also points at the fast changing world where we perceive everything changes all the time. So we cannot sit down to do a proper information model and have a data model set up, et cetera, et cetera, because we’re in a hurry. We need to deliver what? Nobody realizes that if you sit down long enough to have that information correct, you have the data model correct.

And don’t throw away the story that it will definitely give you a return of investment.

Shane: I’ve gotta say that the IBM data model was a masterclass in sales, The fact that you could sell a diagram, a picture for millions of dollars that nobody ever used, apart from putting it on their wall. That is a masterclass. But I go back just to close it out, I go back to that idea of terms and facts.

So if I have a retail bank, had a store. The terms and facts might have been customer Shane has an account. 1, 2, 3.

If I became a mobile only customer when we changed technology, I might see the term in fact change slightly, where it’s customer oh seven five, yada y has an account 1, 2, 3.

Now I can look at those two lines and I can see something’s changed.

The fact that relates to that term has changed and now I can have a conversation about is that just what you are showing me? And in the background it still says Shane, or did it never say Shane in the real data? It was always customer id. 1, 2, 3, 9, 2, 4. There’s a whole lot of conversations I have, but I know that something’s different and now I know what to have a conversation about with that subject matter expert.

As you said in the past, the subject matter expert, the data person and the IT person was the same person.

Then it became the same team. Then we became a business and IT team, and now we are a business subject matter expert team, a data team and an IT team. We are, we’ve team topologies have got more matrix E, more bigger, and that separation causes some problems.

So by using patents and patent templates and artifacts and shared language, we can close that gap again. And for me, this idea of fact-based modeling, this idea of binding terms and facts with a set of rules, with a set of exceptions as context is really valuable. We can use it in so many ways.

So just to close it out, if people wanted to hear more about this, read more about this, find out more around fact-based modeling, where do they go?

Marco: I think the quickest way into it is to just go to I personally wrote a little book which is published by techniques publications called Just the Facts. It tells the story and names a few examples that I mentioned in this podcast too. It gives you an overview from management to architecture, modelers and developers, and how communication and effects weave through all the disciplines. Obviously CaseTalk is the software tool to support all of that. But you will find links for actual books if you happen to have a copy of the DMBOK by DAMA . The older edition has a crippled article about it. The version two latest edition has a slightly improved article about it. It’s on Wikipedia sources enough, but the good starting point would be casetalk.com.

Shane: Excellent. Alright, thank you for that. I’ve got a new set of patterns and templates I need to go and read a lot more about. It’s gonna be me over the next few months again. But thank you for that

Marco: good. Thanks for having me and talk to you soon.

Shane: I hope everybody has a simply magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Redesigning traditional data systems to support Large Language Models and AI agents with Mayowa Oludoyi

Shagility — Sun, 18 Jan 2026 10:46:30 GMT

Join Shane Gibson as he chats with Mayowa Oludoyi about redesigning traditional data systems to better support LLM and AI Agents

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/redesigning-traditional-data-systems-to-support-large-language-models-and-ai-agents-with-mayowa-oludoyi-episode-80/

Listen to the Podcast Episode on Podbean

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

You can get in touch with Mayowa via LinkedIn

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Executive Summary

To effectively harness the business value of Large Language Models (LLMs) and AI agents, enterprises must fundamentally redesign their data systems. The current paradigm, built for direct human interaction via structured queries, is ill-suited for the speculative and exploratory nature of AI agents, leading to significant inefficiency, waste, and prohibitive costs. Agents do not simply query data; they “probe” and investigate, a process that generates redundant queries and consumes vast resources if left unmanaged.

A paradigm shift is required, moving toward an architecture that is inherently AI-friendly. This involves three core transformations:

Developing Multimodal Query Interfaces: Systems must evolve beyond SQL-only interaction to support both structured queries and natural language, providing the rich, multimodal communication channel that agents require.
Integrating Context as a First-Class Citizen: The historical separation of data and its descriptive context must end. Future systems need to store and surface rich metadata, business definitions, and operational knowledge directly alongside the data, providing the essential “grounding” that agents need to perform accurately and efficiently.
Implementing Intelligent Query Management: Data platforms must become active participants in the query process, capable of determining which agent-generated probes are necessary and which are wasteful, thereby preventing redundant execution and controlling costs.

This briefing document synthesizes these critical insights, outlining the limitations of traditional systems and presenting a blueprint for the next generation of data architecture designed to convert AI’s speculative power into tangible business speed and value.

1. The Imperative for Change: From BI to Foundational Data Engineering

The evolution of data roles provides a crucial lens for understanding the need for architectural change. The journey of Mayowa Oludoyi from a Business Intelligence (BI) Analyst to a Data Engineer highlights a critical realization: the ultimate value of any data product, be it a dashboard or a machine learning model, is contingent upon the quality and structure of the underlying data.

Motivation for the Shift: The transition was prompted by the understanding that “everything still come back to the data.” A significant portion of time in analytics and machine learning is spent on transforming and cleaning data, suggesting that more focus should be placed on the foundational data processes rather than solely on the end product.
Challenges in Skill Acquisition: This career transition was not a straightforward path. The primary difficulties encountered were:
- Lack of Mentorship: Finding a mentor to provide structured guidance—explaining the “why” behind learning paths and the nature of real-world problems—proved difficult. Most learning had to be self-directed through courses and books.
- Absence of a Structured Curriculum: Initial learning was described as “a little bit scattered” due to the lack of clear, process-oriented roadmaps. While tool-focused roadmaps (Spark, DBT, etc.) exist, the more valuable curricula focus on fundamental processes and principles, as “tools we always change... it is a means to an end.”
The Value of Mentorship: A valuable mentor is not one who teaches coding, but one who provides structure and context. This includes explaining the types of problems data engineering solves, the reasons for solving them, and the logical sequence of learning (”you need to learn A, before you learn B”). This contextual understanding is often missing from online courses and is essential for career progression.

2. The Core Problem: Why Traditional Data Systems Fail AI Agents

The central argument for redesigning data systems is that they were built for a different user: a human executing precise, structured commands. AI agents interact with data in a fundamentally different manner, and this mismatch creates significant friction and waste.

Defining the “Traditional Data System”

For this analysis, a traditional data system is one where the primary interaction model involves a user leveraging a structured language (e.g., SQL) to execute a query against a database (e.g., Postgres) and receive a direct result.

The Nature of AI Agents vs. Humans

The key distinction lies in the method of inquiry. Humans query, but agents investigate.

Speculation and Exploration: Unlike a human who formulates a specific SQL statement, an agent must often “speculate” to find an answer. It performs exploratory work, probing the data system with a series of queries to build understanding. This process is inherently iterative and less direct.
Probing vs. Querying: The interaction is better described as “probing” rather than querying. The agent is investigating the data landscape, which is fundamentally different from a human retrieving a known piece of information. As stated in the discussion, “agents don’t just query, they probe.”

The Consequence of Misalignment: Waste and Inefficiency

Simply layering an AI agent on top of a traditional data system introduces massive inefficiencies that businesses cannot afford.

The Context Gap

Traditional systems are designed to store data, not the rich context surrounding it. Agents, however, are critically dependent on this context for “grounding”—the ability to understand the data’s meaning, relevance, and structure. While data catalogs have historically attempted to solve this, they often failed due to the high manual effort required to populate them. With AI, this descriptive, natural language context is no longer a “nice-to-have” but a core requirement for system performance and accuracy.

3. Architecting the Future: A Blueprint for AI-Ready Data Systems

To address these shortcomings, a new architectural approach is necessary. This approach redefines the interface, integrates context as a core component, and introduces intelligent oversight of the query process.

3.1. A New Multimodal Query Interface

The first step is to create a different query interface that serves both humans and machines effectively. This interface must be multimodal, accommodating both:

Structured Query Language (e.g., SQL): For precise, efficient data retrieval.
Natural Language: To allow agents to process and leverage the rich descriptive context needed for grounding and reasoning.

This dual capability allows the system to support traditional analytics while simultaneously providing the necessary foundation for advanced agent-based interactions.

3.2. Integrated Context as a First-Class Citizen

Context must be elevated from an afterthought in a separate catalog to an integrated component of the data platform. This can be achieved by providing agents with access to various forms of grounding material:

Past Queries and Code: Providing an LLM with a repository of previously written, successful SQL queries serves as powerful, practical context. Mayowa noted this technique works “perfectly” for his personal use, allowing the LLM to generate new queries based on established patterns.
Code with Embedded Explanations: The most effective context combines structured code with natural language explanations. In one example, an agent’s performance improved dramatically when it was given access not only to a repository of transformation rules (code) but also to comments within that code explaining why each rule was created and what business purpose it served. This provides both the “how” (the code) and the “why” (the context).
Centralized Context Stores: The system architecture must include a mechanism, potentially a “meta store” or a new function within the database itself, to store and serve this context universally. This ensures that any agent interacting with the system has access to the same grounding information, promoting consistency and accuracy.

3.3. Intelligent Query Management

To combat waste, the data system must evolve from a passive recipient of queries to an active manager of them. The system itself should have the intelligence to “determine what query or probe needs to be executed.” By leveraging its own metadata, the system can identify and prevent the execution of irrelevant or redundant queries generated by an agent’s speculative process, thereby preserving resources and controlling costs.

3.4. The Rise of Specialised Agents

An effective strategy for managing complexity and improving results is to move away from a single, generalist agent. Instead, a team of specialized agents, each with a “bounded context,” can be deployed. For instance:

An “ADI the data modeler” agent focuses solely on data modeling tasks and is given context specific to that domain.
An “ADI the engineer” agent handles data transformation rules.
An “ADI de boss” agent orchestrates the workflow, passing tasks to the appropriate specialist agent.

This approach mirrors how human expert teams function and has been shown to produce significantly better and more reliable results by ensuring each agent operates within a well-defined and deeply contextualized skill set.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Mayowa: Hi, this is Mayowa joining from Nigeria.

Shane: Thanks for coming on the show. Today we’re gonna talk about redesigning current data systems to power LMS and agents. But before we do that, why don’t you give the audience a bit of background about yourself.

Mayowa: thank you once again. Again, let me just say thank you for bringing me to the podcast and , like I mentioned earlier, my name is Maya . I started off my career working as a business intelligence analyst. And honestly, I still feel it’s still one of the most interesting roles and anybody , can, get involved in. It’s very interesting. And while working as a business intelligence analyst, this kind of gave me the opportunity to work several projects. I started off working in consulting and so that kind of gave me the opportunity to work on, several projects.

We have clients working in telecommunication, in banking, in health tech. And so , starting off that way, give me, enough opportunity to experience, different projects. And while working, I had the opportunity to working on several projects, like working on data warehousing, building data reporting, power bi, Tableau and all of those interesting data projects. But then as time goes on, I move on to. A new company after two years. And then in this new company, it is just basically a payment company that is in FinTech. And then, , while I was there, I did a lot of work around analytics, data engineering, and that was where my interest in de engineering started, evolving, and from there I left the company another after two years, and then I moved into a bank. And then this time around I started working as a data engineer. And why, working as a data engineer, as a bank have a lot of legacy system. There’s a lot of things, a lot of moving parts when it comes to data engineering in a bank. You talk about privacy, you talk about, building systems that are highly protected, all of those things.

So while in the bank I started, having this interest in. how to actually build, robust pipelines. So I wanted to move from the normal, building ETL. And so I started learning and that was how my career has, spiraled all through the years. And yeah.

Here we are today.

Shane: that moved from being a business intelligence analyst focusing on the reporting. Tablet, power bi, I’m assuming the data was served to you, so you grabbed the data you needed and you focused on visualization and user experience,

And then moving back into the engineering side, into the code to transforms the data, collects it, all that kind of gnarly stuff that, that’s a common path for people.

How did you find it? How did you find changing from, a set of tools and a certain, set of skills to then expanding them out into new tools and new skills? Was that an easy journey or did you find that step change quite difficult.

Mayowa: I’ll say for me it was not that easy journey. The reason being that the resources are there on online, most time the resources are not what translates to business value, right? So sometimes you have to struggle to get things done. So it was not an easy journey. But one thing that I feel prompted this shift in me was that at the end of the day I discovered that one of the most important process still remains the data. You can have beautiful dashboard, you can have machine learning models that are doing, fantastic. But everything still come back to the data. So I started asking myself the question, . I think I spend a lot of time, transforming data, cleaning data, whether to produce report or to do machine learning. And so I feel like it’s important to spend a lot of time around the data itself than even the product. I mean, The end product is really important, Because that is the whole point. But then I feel like it’s important to spend more time on Iran. So that was what prompted, but then I had to, look for a way to attend conferences, take courses online, and then that shift was not easy.

But, gradually I started building all the necessary experience that I need. Yeah.

Shane: And so often I’ll see people that make that change from being a, bi centric set of skills to that data engineering set of skills. Often they’re in an organization that actually encourages that change. So they’re in an organization, they’ve got a bunch of mentors, a bunch of people who’ve done it before and they can get a lot of help and mentoring about the things they need to learn and a lot of feedback.

In some organizations that’s not possible. So people will go and study on the site, they’ll go and find courses and they’ll try and learn it to help them then, effectively change organizations to change roles. Which way did you go? Did you find that you had mentoring and support within an organization to do the step change?

Or did you have to learn outside the organization you’re in and then change jobs to, to go into the new role?

Mayowa: honestly I couldn’t find a mentor. I had to take the other route. I went all out to, get information. I have a couple of people who, tell me exactly, oh, this is what you need to do. You can take this course. I have those people, I have people tell me that, but I didn’t have the opportunity of having someone. Who will hold my hand through the rubies and show me everything I need to know. So most of the time I had to go out there myself, pick up courses, pick up books to read and that was it. So that was the part that I took. But I understand what you’re saying. There are a lot of people who have the opportunity of having senior data engineer on the team who can tell them what to do, and then they learn through, working on different tasks, different projects along the, but I didn’t have the opportunity.

Shane: And if we think about the idea of a curriculum, an idea of these types of courses, these types of tools, these types of skills and learning them potentially in this order makes sense, so almost like a manifest or a curriculum of what you might want to do, rather than having to try and find it all yourself.

Did you find that, did you find that there were resources out there that gave you the idea of a bootcamp? Or did you have to go and talk to lots of people and cobble it together to yourself to figure out how you went through that path?

Mayowa: basically, my knowledge was a little bit scattered at the very start because there was no tailored curriculum to give me more a roadmap. These days I see a lot of roadmaps here and there, and I wish I had this when I was starting, right? But then when I start, when I started, I didn’t have this roadmap.

I had to learn. So at the end of the day, I have, a lot of knowledge and it took a lot of time, a lot of, learning to put everything together to make sense actually. But I think that problem has been solved. I think I’ve seen a lot of resources online, a lot of roadmaps that I think is doing justice to that now.

yeah. And those roadmaps, are they primarily technical and tool-based, or are they about the process and the ways of working as much as they are about the technologies? Because again, when I see, if I look at job ads or I look at what people say they do, I’ll hear DBT Spark, Databricks, I hear tools and technology versus the skills and the tasks that, you need technology to deliver them, but they’re just as important as the technology itself.

Shane: So what did you find?

Mayowa: I can share this with you at the end of the, session because I don’t wanna mention names. there’s this particular roadmap that I’ve seen, which I think is very good. When you look at the roadmap, they didn’t spend a lot of time talking about tools.

Tools can change, but they spend a lot of time around processes, which I think is really great. Things, fundamentals that I think everybody should know when it comes to data engineering. And even though that domain I think is really good. And I shared the same. Opinion. I see a lot of roadmaps where they talk about, spark DBTI, I don’t think that is the best. I think it’s because tools we always change. They are, tools are like, it is a means to an end. It’s not an end in itself. So I feel like roadmap that talks about the real process, the real principles.

I think they’re very great and I think that is what we help a lot of people to, to move up quickly. .

Shane: We met each other on the practical data community and it’s still one of my most, my favorite, most active communities I’m in. And yeah, I think something, something we should look to do is creating that roadmap or that curriculum as an open source roadmap and curriculum as part of that community, because lots of people have that struggle when they’re trying to get into the domain or drop a jump across the roles.

Where do you start? And so last question before we get onto the current data stuff because this whole onboarding of a person into a new career really intrigues me. So when you talk about, if you had have found a mentor who would’ve helped you through that process.

What would you have expected them to do with you? What would you want them to help with?

Mayowa: Sorry, can you repeat that again?

Shane: So you talked about the fact that, you were looking for a mentor and you struggled to find one.

Say you had found one, let’s say somebody had said, yep. Happy to help you take that next step in your journey. What would’ve been valuable for that mentor to do? What would you expect them to actually do to help you progress your career?

Mayowa: I think one major thing I look out for in a mentor, it’s not to. Teach me maybe how to write code, but rather just to show me exactly what I need to succeed and I’ll explain what I mean. So for example when I was trying to, break into data engineering, it was difficult to actually see somebody who have the experience of what happens in data engineering. There is a lot of resources on how to write SQL Python and all those things, right? But I need somebody who work in data engineering who tells me, this is the kind of problem we solve. This is why we need to solve this problem. This is why you need to learn for, that kind of structure. that is what I’m looking for, and don’t think it’s something that is readily available. In the market, or I’m not even sure. Most courses offer that even when they, offer these courses, online. So I think for me personally, I’m looking for in a mentor, somebody who is not just interested in, learn these skills, but tell you exactly the structure of learning. All right? This is important. For example, you need to learn A, before you learn. B, you need to learn B before you learn C, something like that. That is what I’m looking for in a mentor.

Shane: I think that’s the important part. I, so I mentor people every now and again, and I know a bunch of people that want to mentor but aren’t sure what the process is. And they often think it is more teaching. They think they’re gonna have to spend hours teaching somebody how to do something. And if you’re in an organization where you are mentoring somebody in your team, junior, then yes you probably will do that.

But if you are external to that organization, to that person, to me it’s about connecting with somebody, listening to where they’re at. Potentially making a suggestion what they might want to try next and why it’s about storytelling, one of the things I find is if you can tell a story from your career that I did this and this is what happened and this is why it happened, that’s valuable.

And then the last one is context. When somebody says, okay, why does everybody say you shouldn’t do real time streaming? The context is, it’s expensive still to do that versus batch nine Times outta 10, they don’t need the data straight away. They can live with 15 minutes or an hour. So everybody says they want it, but actually when they see the cost of it, they say they don’t.

That’s the context. And you tend not to find that story of the context of why does everybody say don’t do near real time? Unless you have to have, it is not well written somewhere. And it’s definitely I don’t often see it as part of the course, so Yeah. I’m with you.

Is it really is an hour, maybe to a week having a chat with somebody and just helping them with their career. And so for me I think about all the people that helped me, all the people that helped mentored me and my career, , I think people who have been doing it for a while need to pay it back.

So anyway end of end of pitch for people to mentor more people in the world. Okay. So let’s talk about this idea of redesigning the current data systems to make them better for LMS and ai. So when you kind of propose that as a subject, talk me through it. What do you mean by that?

Mayowa: I started thinking about the whole LLM. Thing some few months back. As a matter of fact just like I told you, I have a, an article I want to release, which I think I just wanted to talk about, the traditional data systems and what we need to get to that point where enterprise can take advantage of what LLM is offering right now. And this started off from, working and everybody’s talking about LLM, since 2023 and now everybody has been talking about LLM and the whole boss, and there’s so much, happening LLM these days. So I ask myself on the job right now, I use charge GPT, I use Claude for a lot of things, but I ask myself, how does the enterprise, how does the business still get value from these LMS at the end of the day? So that was all led to that thinking. And then I started exploring and I’ve seen very great articles and, research papers around that. So for me, the management team here or where I work, they’re not just interested in hype, they’re interested in value, so before they put money into any of these things, they want to know how much value they can get from it.

So sometimes it’s difficult to get a buy-in the management about, some of these things unless they see, what they can get from it. So that was why I started talking about, I started thinking about how we can, bring value to the enterprise through early. But I then I discovered that. the current system. I don’t think that value can be created immediately with the current systems that we have, if at all. There’s gonna be a value. When I say value I’m not saying, we’ve seen agenda doing interesting things like, a lot of people integrate agent to their GitHub repository.

that is good for the developer, but when it comes to the business, I don’t think there’s so much value that they’re getting from that. So I think for us to get to that point, there’s gonna be a need for us to rethink some of the processes and even the data system itself should, so that, that was how I got to that topic, to that theme that I gave you.

Yeah.

Shane: Okay and so there’s a couple way we can approach this. We can look at what you would see as a traditional data system And describe that, and then figure out what you think should change, or we could just. Identify some things, people, processes, technology design that you think when you look at that, that needs to change for the use of LMS and to add value back to stakeholders.

So I’m with you. We see lots of use cases of developers using copilots and LMS to automate or make their jobs easier, remove that drossy work. We see, some interesting use cases using those tools to automate business processes

Data. And then there’s this third. Kind of dimension, which is agents using data for stakeholders.

It’s not business process automation, and it’s not co-pilots for developers. And

Now everybody goes, oh, it’s text to sql, right? Ask a question get an answer. That’s the obvious use case. So which way do you wanna do it? Do you wanna describe a legacy system? And then what your change, or do you want to talk about specific use cases that you think are valuable to a stakeholder?

And then we talk about it from that lens.

Mayowa: I think we talk, we can talk about a bit of the two, when I was, trying to, just put some point together before joining I actually have three questions. I feel like it’s important for us to answer.

One of the most important questions is why do we need to redesign the data system? Why is that we can’t take advantage of the current system? I’m talking about the traditional systems, right? I think that is very important and then. We then need to answer the question, why can’t the current system be leveraged as it Because there are two different things. One thing is that we need to answer the question why we need to redesign the data system. We need to talk about why can’t the current system be used as it is? Unless even, maybe we can even add on top of that. what do the traditional data system looks like?

Maybe somebody’s listening to this conversation and does not even know what the traditional data system looks like. So we might even want to touch about that. And I think the final question that I think we should try to provide an answer to is how then do we redesign the, system that we feel or that we think will be able to get us to that level where we can, get massive gain from LMS and agents.

Shane: let’s do that. One of the things that we need to always recognize is we need to anchor our language in a way that everybody else gets a shared language. That we are using. So when you say traditional data system, I have in my head what I think a traditional data system is, and it’s probably not the same as yours.

Because mine’s based on, 30 years ago, 20 years ago, 10 years ago, there’s been, broad iterations of traditional data systems. What I always suggest people do right, is when they’re, they’ve got a mental picture in their head, they’ve got a map. It’s always useful to describe the map.

In a way that everybody goes, oh, that’s what you mean. But let’s start off, what do you wanna do? Do you wanna do, why do we need to redesign traditional data systems? Or why can’t we leverage them for lms? Take me away.

Mayowa: I think we should start with the why, we designed the data system. I think one thing that is really important is that we need to know that LLMs or agent, whichever way you want to call it, they’re different from humans, right? if you have used agents or if you’ve used LLMs, but whether it’s strategy or clo, whichever one, and then you use it around, data, you will know that they don’t ask once, they don’t ask once and then analyze. There might be a need for you to, ask again, change some things. And so because of that, an attempt for these agents or LLM to provide an answer they speculate. That means they try to do some form of exploratory work, so that makes them different from human.

So because they’re depending on whatever it is that you give to them, for them to now, think. So that in itself tells you immediately that they behave deriving from humans. And so it is the natural way we get information from data system cannot be the same way that these agents get, information for us.

I don’t know if that makes sense.

Shane: Yeah, so let me just play it back and see if I understand

What you’re saying. So if we look at the lms, the foundational models, they’ve been trained on large knowledge bases of text.

And so the early days, we’d ask ‘em a question, they would search that knowledge base and they’d come back with an answer.

And all we did was we’d ask a question, it would look into itself like a human would if it only could look at its memory. So it would go what do I know about that? Here’s an answer. And then what we found was actually we wanna provide context to it. We wanna provide some additional information that foundational model may not have or may not have brought to the front of its memory.

And so we started seeing techniques like rack, and prompt engineering ways of giving it additional instructions on how we want it to behave. Because the way my dentist behaves is different to the way a data analyst behaves. The prompts were behave like a dentist, behave like a data analyst.

And then the second thing is, here’s some other information you probably don’t have that again, a human would ask, if you said to me how many customers have we got? I’m probably gonna say to you, what’s your definition of a customer and where do we hold the data about a customer so I can go look at it and count it for you?

And you want to count today, right? Not last week or next month. And it’s okay to count it to one, So 1,001 customers is okay. Yeah. So there’s a bunch of questions. As a human, you would ask if you’re an expert in your domain. And so this idea of context and reinforcement is passing that back.

So we pass the LLEM prompts to tell it how to behave the data we want it to look at if it doesn’t have access to it already. So we give it access and then the context, additional information we know is valuable to it. So is that what you mean by, The change in what we have to provide versus a SQL statement, which is select this from there, and all we gotta do is give it the code and the data and the machine runs that code on that data and we’re done.

nothing else needs to happen. We don’t need to tell it anything else. Just run this code.

Mayowa: I think the key words you rightly explained it, but I think the keyword there is that because LMS are going to be, or let me say agents let me let, for, let’s just use agents, all right? because agents are gonna be one of the ways we interact with data, when it comes to LLM, agents are gonna be like the main use case, right? So I think the key word is that since agents are not human, there’s a higher chance that in an attempt to provide an accurate response to your queries they speculate. That mean they spend a lot of time, doing some exploratory work, and that in itself have impact on the data system. So because there’s gonna be a lot of redundant queries, things that are not necessary that you will need to fine tune and all of those things. And so because of that I think just, like you said, they have been trained on this massive test that might not even be relevant to what you’re saying.

But then they have to re something back for you. So I think the key word here is that they speculates, and in all of these, it’s an attempt to just give you a response.

Shane: Okay so again if I play it back, if we took a human behavior, it’s like me saying to you, here’s my one terabyte data warehouse. go and tell me how many customers I’ve got. But by the way, there’s no table called customer.

Mayowa: Yeah.

Shane: And now you’ve gotta do a whole lot of exploration, right?

You’ve gotta go and try and figure out what’s a customer called? What table does it live in? How’s it defined? And you’re gonna go and do all this work. And that work takes human time and it takes compute from the system because I’m gonna be writing queries. I think what you are saying is with the L lms, it’s the same, If we say to them, here’s a big blob of data. With no context and no constraints and go answer this question. cause with the reasoning models, and reasoning in quotes,

Going through and building itself a curriculum. It’s building a manifest. It’s saying, I’m gonna go try that.

That didn’t work. I’m gonna go try that. That didn’t work. And it’s trying lots of things. And because it’s a machine, yes, we sometimes see it tell us what we’re doing. But it’s doing a whole of work under the covers. And one of the other things about that is token cost. Because every time it does a task, it’s using a token that costs us money.

If we change the way our data’s platforms are structured, we can make the LMS more efficient, more effective, and require them to do less thinking because we’re giving them hints of what they should use. And when is that the angle you’re taking?

Mayowa: Yeah, correct. The fact that you mentioned the cost thing, make me remember, some of the reasons , so I wrote here you pay for waste. and this is the, one of the reasons why I feel like we need to redesign this is because enterprises, businesses don’t want to pay for waste, Redesigning this system is how we convert, this speculation to speed. Because it’s important for us to know that businesses whether small or large, not in any way we allow waste, So it’s important for us to know that agents. They don’t just query, they probe and they continue to, and there are a lot of redundancy at the end of the day, if we don’t, get this system right, because it is in the nature of agent to probe instead of just getting that.

Yeah. So yeah you’re absolutely correct. Cost is a very important

Shane: And one of the things is at the moment we’re not paying the true cost of those tokens. We pay $20. Okay, now we have to pay 200 a month. Okay, now we’re getting some constraints from Claude and those tools where we can’t just run everything forever before we run out of our, our limit of usage.

But we’re still not paying the true cost. And so while we can do things that are lazy right now and they work and don’t cost us a lot, it’s gonna change, right? And so, you’re gonna start getting the $30,000 bill and we might as well start designing our systems now to be cost effective and efficient.

And I think the other one is speed. If you think about back to your power bi Tableau days, there was a rule of how long you could let a user wait before that report rendered, it was a second or two, and then they start going, this is too slow. And then you think about how much engineering we put in to summarize the data or to give us that performance.

And now you go into an agent, an LLM, and you ask it a question, and then when it’s doing its reasoning models, it’s really interesting to watch how it just sits there and gives you some feedback. It’s working and it’s not dead, but it’s taking way longer to bring back counter customer, which I could have had on a Tableau dashboard in less than a second with what, with, a blog code.

So we’re still experimenting where these agents are best fitting. Okay. So what we’re saying is things have to be changed, Because just throwing an LLEM on top of your current data platform is not gonna be the most cost efficient, not gonna be the fastest it is gonna.

And that waste will hurt us. Okay.

What’s next?

Mayowa: I think we can now talk about what the current data system looks like, right? What I term traditional and like you said, traditional in this case, maybe means several things to several people, right? But for me, traditional just mean the current system where we are able to use languages like structure Korean language, SQL, I, so right now the way we work is we leverage. Platforms, databases, Postgres, name it. All right. And then we just run a query and then we get a result, and that’s what the current system looks like, right? But then when you look at the way agent works I think the right word to use is that agents don’t just use queries. what they do is not just querying. What they do is more like investigating, probing. they want to probe. So we need systems. That are beyond, of course, SQL is very important. Many times people say SQ is gonna be very, we are, we’re always gonna be, I see people learning SQL every day.

I still see a lot of university teaching sq l in their curriculum. So I’m not saying SQL is going anywhere, but I’m saying we need system that will accommodate for other things like natural language. Because these agents will need much more than SQL, like we said, they would need contests.

We need to provide some kind of contest. So there might be, maybe there’s gonna be a need for us to have interfaces that allow for not just the structure query language, but then maybe natural language. So the current data system that we have does not, have that yet. So there might be a need for us to. Incorporate these kind of interfaces, To accommodate for these agents to perform better.

Shane: In the past we’ve always had data catalogs.

So we’ve had this idea of a catalog that set across the technical metadata, the technical context, the technical structure. So I have these tables, I have these columns, I have these values.

And then from a governance and stewardship point of view, we always knew that adding context into that catalog had value.

So that table holds a record for customer. That table holds names, therefore it’s PII that is the table that you query. If you want a single list of customer. These other ones they don’t hold everything or they’ve got duplicates, There’s all this language, this descriptive stuff that was useful.

But what we learned was nobody ever does it. The cost of creating that context was way higher than the value in using it. And yes, we had big organizations where it got mandated and they were certain industries where you had to do it. And yes, we got big tools that automated some of it and that, but in my experience, data catalogs was the tool you bought and two years later you turned off or nobody used it.

I think with LMS and agents, I think that those types of capabilities where we, we bring this natural language context in as text, as description of information is highly valuable. But I’m not convinced that data catalogs of old are the right place to do it. And the reason I say that is I think they’re disconnected.

They’re exhaust they hoover up exhaust. So we hoover up some technical metadata and then we ask somebody. As an extra task to go and add the business context, the operational context of, how many rows are in there, what’s the data quality like and they don’t even have this idea of a gen context, you can’t really, at the moment store prompts or hints for an LLM for an agent, I don’t think they’re fit for, I don’t think they’ll fit for purpose in the original days, but I don’t think they’re fit for purpose now. So I’m with you. I think that there will be a new paradigm coming out about how we hold this language, this description, this context against everything to do with data.

And then that is highly valuable to the agents. And then the way we capture it, to me, it’s gotta be done as close to the creation of the code or the data as possible. So data engineers who are doing the work. Have it in their brain. They know how the accounting customer, I’m gonna create a dim, I’m gonna create a hub and set.

I know the logic because I’ve asked, and I’ve done that work. They’re the ones that we have to make it really easy for them to capture that tribal knowledge into a place that the agent can go and find it. Or maybe we just have probes in our head and then the LLM could ask us. ‘cause when you work in an organization, you probably do it.

You go, oh, I don’t know how we calculated that. And you go and ask Bob. And you know, you, you know somebody in your team that you can go and ask and you get that tribal knowledge outta their head. Maybe, we’ll, I’ve been facetious here, but maybe we’ll have probes in our head where the LMS ask us.

Yeah. So is that what you mean, that idea of that really rich, descriptive stuff? Yeah.

Mayowa: So like I said, it’s summary. I think it’s important for us to take advantage of these agents gonna be a need for us to have a paradigm shift from the normal, SQL against databases. There’s gonna be more things that are involved, like you mentioned, maybe catalog that maybe need to have the data system itself.

Maybe the databases might have something that stores a lot of context in memory, just to provide more context. But I think one thing that is important for this agent is that they need grounding, need to have a lot of information for them to act, anyway, so the current data system, we have to evolve to the point where we have some of these components, integrated into the system for us to take advantage of the agent.

Shane: Let’s just take that comment around probing from the way that you meant it, not me putting a probe in my head for the LL to get my knowledge. And the traditional way we’ve always done it is we’ve taken data from source systems and we’ve bought it into one place and we’ve typically put it in a database of sorts that, like you said, we can write SQL and we can get an answer.

And yes, we played around with no SQL databases and Hadoop and whole of other stuff, but we keep coming back to a database column that stores data in a certain way and allows us to use this standard language to get the data back out. Seems to be a valuable way of working. And then we’ve now got this introduction of MCs where we can effectively.

Create almost an API or a, a network port or an HTDP kind of endpoint for the LMS to talk to a system and get a response, So I need this, give it back to me.

And it’s not writing sql. it might talk to the MCP server and the MCP server might write SQL to talk to the database and get back and then hand it over.

so I look at it and I go, do we know that there’s a bunch of use cases where probing won’t work? And I can come to those in a minute, but why wouldn’t we just fundamentally change the way we work? So rather than grabbing the data from the source systems and put it in one place, why wouldn’t we just expose these MCP services and allow the LLM agents to always just do a one and done query?

Have you looked at it at also actually remove the data warehouse? No. We’ll come to the use cases. Why? That probably won’t work at the moment, but I’m just intrigued by that. It’s, just go away, ask the question, give the answer. If the context is bound around it, if the context is the thing we care about, that’s the pet.

And where the data live really is the cattle. It’s an interesting architecture change to what’s been 40 years of my life.

Mayowa: So I think is an interesting one actually, but of course, we can talk from ante toward the reason why that is not gonna be possible. But I think it doesn’t stop the fact that it make us start looking at things in a different way the way, we used to work.

like you said, I don’t think there’s anything stopping us from doing that. Even from hindsight you start thinking about the cost effect and, a lot of things that might, but I think that there’s absolutely nothing stopping us from exploring that data warehouse today and this is me just digressing. In my experience, the whole idea behind data warehouse, which is consolidating data from different sources and putting it in one place. When you look at it, if we ask ourself this question, has it really, solved the problem? Because I’ve seen places where there are still information that you can’t find in the data warehouse.

That warehouse supposed to be the source of truth, but there are still information that you can’t find in the data warehouse. So if that is still a problem, so what if we have, like you described, if we have a sea of data where we can just throw these CPS to, get information.

I think it’ll be about, but then. there will be a lot of things that goes into making that happen.

Shane: And a high chance of waste. Yeah. A high chance of

Wasted cost , slow. All those things that you talked about at the beginning that actually we need to focus on at the moment. I think whatever we designed would give us those problems.

But that’s at the moment because we’re in the beginning of this whole new wave and so we haven’t really done the rethinking, the re-engineering of what it would look like.

So where’d you get to in terms of, if you started with a blank piece of paper, what would you do right now to build a data platform capability that is LLM, agent friendly.

Mayowa: The first thing is that need a different query interface. Just, like I explained, we need a different query interface and this interface should accommodate for both natural language, And also the structure, query language, SQL and the rest, this interface accommodate for that. But at the same time I think to avoid some of the things that I’ve talked about, like redundancy like waste and all of those things. I think there’s gonna be a need for us to have the data system being able to, determine what query or probe needs to be executed. Just to prevent that waste. Because even when we have this interface that accommodate for several mps that allow agent to run, different queries against the data system, it is important for us to also put, some measures in place to know what query to execute. Because at the end of the day, whatever you get back, it’s not gonna be all useful.

So I’ll say an example, so if you say something like, oh give me the sea strain of the sea strain of maybe battery in the United States. An agent is just probably gonna run a query against different web servers, different webs and just print. Everything is not gonna be useful for you.

But when you take that into an enterprise environment where you need to manage costs, where you need to ensure that you maximize and optimize your queries, that you’ll see that is gonna be a big problem. So for me, I think the first thing I will look at is that we need a different query interface.

That we need to find a way that the data system, determines what query to execute. So for example, the database know, if there’s a way to, of course there’s gonna be a need to implement maybe metadata storage and all of those things to give an idea of what currently exists within the data system. So that determines immediately this is what is available. So if you run a query that is not relevant to what is in the device, you just don’t execute it. So that is the way it is in my head right now. So we need a new interface. We need something within the data system to determine what query need to be executed. is the way I will start designing.

Shane: What’s interesting is this idea that when we move from humans to machine, we get infinite scale. So that example you used of, what is the sales trend of batteries in the US

A human, I have some natural constraints. I have natural constraints of time. I can’t spend two years going and finding that number.

I have a natural constraint of knowledge, right? I’m gonna Google it. I’m gonna find some websites, and then I’m probably gonna even run outta time or get bored, and I’m gonna stop. Whereas when we talk about , the agents, it’ll scale. If you let it right it’ll search every website in the world.

If you let it, if you till it to, it can run a thousand human, I don’t know. What is it? CPUs for the computer, miles per hour for a car? What’s a agent cycle for? For number of human hours. So it, it can run a lot of human hours in a short amount of time for a cost.

Anything we do that is a Pattern based on a human constraint of I don’t have time, I can’t scale myself, that disappears. And we’ve gotta be really cognizant of that, or we’re gonna get massive waste again, like you say, I think the other thing is this idea of, if you look at human behavior, if you’re in a data team and you’re in a team of five to nine, you’re gonna find that normally you’ll have a data modeling expert, you’ll have a data collection, source system expert, you’ll have a data engineering expert, you have this experts and you are naturally gonna go and ask the expert for some help. Hey, I am looking at this and I need to model it. And I know we, we are a data vault is our standard. So can you just. Gimme help because I’ve never done it. I don’t do it very often. Or can you peer review And one of the things we’ve found as we’re building this out is that when we had a single agent, so our agent’s called 80 when she was the only agent we used to flutter. So we would say, go and do this work. And she’s just time slicing her skills all over the place. And we always got back an okay response as soon as we then broke her out into other agents.

So we have 80 80, the data modeler, 80, the engineer who goes and figures out the transformation rules and then 80 debos. And so we talked to Ada Deboss and she knows about these other versions of herself. Then she goes, oh, okay, the next thing I need to do is model that data. And she talks to a, the data modeler.

And what that means is we provide a, the data modeler, a bounded context. We give her very specific prompts around You are a data modeler. You’re not an engineer, you’re not the boss, you’re not the bi you, this is your job. These are your skills. We give her a bounded subset of context. you can look at the data structures but you can’t go and do data quality tests.

Because we’re talking about, in this case, conceptual modeling. That’s not your job. Another 80 will then take what you’ve done and make it better or do her task. And what we find is that specialization of skills and then that handoff, which happens in a team, but you don’t really see it, if you then programmatically do that with the agents, we seem to get better responses.

So I’m with you. I think this idea of. What interface, what language, what set of skills, what persona can be bounded and then handed off all over the place. And that’s, I think, where we’re gonna need to, I still think sql to query data if we need to do that

is gonna be done.

However, we’ve had great success giving images to the lms, not data,

but it’s expensive. It’s wasteful. So if I give it a photo of a an image screen capture of an Excel spreadsheet,

It’s gonna do really well with it. But it cost me a lot more tokens than if I give it the CSP.

So I think it’s, again, that trade off between outta the possible right now and then the cost and waste of doing cool stuff.

Mayowa: So I think part of the reason why I also mentioned that it’s, important for us to have this interface that accommodate for both natural language and also SQL is, right now, part of the way I work is I actually have bunch of queries that I’ve written in the past.

And I just dumped them in my LLM dump a lot of them. And right now I just ask questions. And because this LM already have history of my queries that I’ve used, TPV, how to calculate TPV, how to calculate TPC and all of those things. So I just say, Hey, can you give me, develop a query to, and it does that and then gimme the output.

So I think part of, what you’re saying is, providing contest is. There’s still gonna be a lot of work to be done around, contest, so giving queries that we’ve used in the past, or maybe anything that can provide more grounding. We actually, help resolve some of these issues.

I’ve seen it work firsthand. It’s working perfectly for me. There are a lot of requests that I don’t spend my time developing the query anymore. My, LLM just, does that for me,

Shane: And that’s a really good point, is this multimodal thing. Joe Reese talks about multimodal arts. And so this idea that no organization actually uses one data modeling technique for their warehouse, They may say they’re dimensional star schemas, but there’s some relational stuff in there.

There’s some sorts. Yeah. We always have more than one modeling technique, and I think we’re gonna end up with more than one language. And that multimodal language is gonna be important. And so I, I’ll give you an example we. Presented all our documentation to 80 for our product.

And so we describe what we call change rules, which are transformation code. And we have a set of language you have to use for it. And so we described, how the rules work, what the structure is, what the structure of the language is, and there’s some examples,

if you want to go and pivot, do this if you wanna un-pivot, right? But it was all in text. And so when you know, somebody comes in to our platform and they go, I need to transform this data and they just write in plain text, I need to transfer this data, how would I do it? She came back with an okay answer.

But it was just, okay. And sometimes it was just wrong, she would hallucinate and we’re like, yeah, that’s not how our product works. And then what we found was effectively we run a multi-tenancy architecture. So we stood up another tenancy, or one we already had for our partners called Alliance, and we pushed example transformation code, our example, change rules across every customer we had.

We took the rule itself, not their data, and we said, this is a rule we’ve applied before. So your, exact same example of yours, of here’s a blob of code, this is a rule that we’ve used before. And we then created an MCP server that 80 could see, to see those rules.

That’s when we broke off to 80 the change rule.

And so all she does is if you ask her how you can transform this data in natural language, she will go and search those rules. But she’s effectively searching a code repository that is highly opinionated and highly structured, the language is the same. What’s in the language is different. And our the response back we get is so much better now.

And so the thing we found though was while we had the rule, the logic the language of the rule, we had no context. We never wrote down

Why we were doing. So as soon as we added that, a natural language of this rule is to take stats New Zealand data, which is being summarized as columns and unpivot it to rows so that we can use it later when we need a row per record.

Then she’s ah, now she’s getting both the, the thing. the same as you are saying is that if you take your blobs of code that you use on a regular basis, and I think you, when you talk about TPV and TPC, you are saying there’s some metrics in there, if you then in that code, put in the definition of the metric, TBV equals and then a bunch of texts, and that’s as a comment in that code, then the LMS are gonna get both a structured piece of code and context.

They’re gonna get sequel and they’re gonna get natural language. And I think that’s where we’ll end up.

Mayowa: going back to your question, if you ask me what are the things I need to, think about if I’m going to design what this system should be, I think the interface is gonna be very important. And also, how to manage what query needs to be executed to avoid waste, redundancy. And, I think that is the way I’m gonna think about it.

Shane: So let’s take that example where you’ve got you and blobs of code that you have found highly valuable and you’ve already tested putting that code as a reinforcement model, given that context of the code to an LLE and been able to ask questions and get back the help that you needed.

And let’s say that you do extend it out where you actually define what those metrics are in, in natural language, so it has a richer context. And then you’ve got three other people in your team. Who need to do the same thing, like right now, what would you do? Where would you store it?

Where would you surface it? what interface would you use to create it, to share it? Like how would that work for you right now?

Mayowa: Now the way it works, this is just for my personal use. I’ve not had any reason to share with anybody. But I think that also, point to what I was talking about when it comes to, conceptualizing what this data system should look like, I think there’s gonna be a need for us to have part of the data system, maybe the database, a part of it that stores these I don’t know whether it’s gonna be a meta store, or whatever, that stores this information. And the reason being that regardless of who is running this query or who is submitting a probe, they have access to the same information.

And that information can help provide more grounding the LLM or the agent, or MCPO, whichever one that you’re thinking. I think that part of what we need to think about is a part of the data system. Maybe this can be the database. Now maybe this open a new opportunity for, database research to see how we can, if part of we can store, information that provides more grounding to JLLM.

So that is the way I’m gonna think about it, but right now I’ve not had any reason to actually explore.

Shane: it’s an interesting one because if we think about the fact that we’ve always stored our operational data, O-O-O-T-P data separately from our analytical data,

Our OAP data, that’s because in the past we never really had a technology that allowed both to be stored efficiently in the same place and queried, they were two different query pads, two different storage patterns, and that’s not true anymore. We’ve got tools out there like single store that say I don’t work with ‘em, so I dunno if it’s true, but they say they do that, yet nobody’s really adopting it, that I can see, we still keep it separate.

And so when we start bringing this idea of a context store Victor databases, we seem to be the thing, but actually a lot of the times you don’t have to use those. You can use anything. So I’m intrigued like you, to see whether we end up with yet another database,

We now end up with.

Databases that store the data and the context side by side. It is an intriguing place to be. And then do we end up with a thing that I call the context plane, which is the idea of a shared centralized layer of context? Or do we end up with a context grid? The idea that context is stored next to the data and then something else federates it, provides a grid type architecture. And so again, we’re at the early days where lots of people were exploring and saying what works and what doesn’t. So just on that, just to close it out, so this is something you’ve been thinking about in your spare time, right? You’re not part of a software company building this, your organization.

You’re not part of a team building it out for them. This is something that just natural, inquisitive, nature going, yeah, that’s cool.

Mayowa: Yeah. Pure, natural, inquisitive. I’ve read a lot of papers. There’s this interesting journal that came out of Berkeley. I can’t remember what the title is now, but it’s very interesting. They did a very fantastic job around this conversation too.

So yeah, for me, I’m not part of any team actively developing something in that space, but it is just something that I’m just interested in.

Shane: And the next step is you’re gonna start writing. You’re gonna start writing your thoughts, starting write your explorations.

Mayowa: That’s part of what I’m doing. By the time you release this I want to listen to this conversation again because some of the things that you said, I think that are really interesting. I want to look into them more deeply. So yeah, I’m just documenting my thoughts right now.

Shane: And my advice to people is document lightly and document early. So share lightly, share early. don’t wait for this podcast to come out, document what you think right now and then write it up, share it, and then think about it a bit more and then write it up and share it. because people often find the journey of the way you think, as interesting as the answer. And what I say is by writing, it’s forcing you to think with clarity. you think, you know something, you’re like, oh yeah, I think I know how that’s gonna work. And then you write it down and you’re like, yeah, no that’s bollocks, that’s not gonna work.

So that art of writing just makes you think, ‘ writing is structured it has beginning, middle, and end. it’s gonna force you to story, tell to yourself and validate what you think is gonna happen as, so yeah, my recommendation to everybody is, write small bits.

Push it out early, it helps you think it’ll change. And that’s okay. it’s not a book, It’s not something that you can’t change. It’s Hey, I thought this and now I think this, and I think that’s better than what I thought before. But I had to jump from A to B2C to D to get to E.

Alright when this does come out, how do people find what you are writing? Have you worked out where you’re gonna publish it?

Mayowa: If I’m gonna post, it’s gonna be on my LinkedIn if I’m gonna leverage any other platform. I also put it on my LinkedIn. So the time I’m done with this, I think everybody can find it on my LinkedIn.

Shane: Most of us are now writing on Substack ‘cause LinkedIn sucks for long form content. So I’ll encourage you to create a Substack yeah. And then till everybody on LinkedIn, that’s where the long form content is. ‘cause LinkedIn’s kind of killed the ability, which is really sad. ‘cause actually I’d rather just write in one place.

Excellent. People can see how you’re exploring and what you’re learning and what you’re sharing. I look forward to it.

Mayowa: Yeah. Thank you.

Shane: I hope everybody has a simply magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Building Data Services with AI with Jason Taylor

Shagility — Sat, 10 Jan 2026 08:27:16 GMT

Join Shane Gibson as he chats with Jason Taylor a former quant researcher who turned towards the light (or dark) side of data, to explore the practicalities and pitfalls of building data services using AI

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/building-data-services-with-ai-with-jason-taylor-episode-79/

Listen to the Podcast Episode on Podbean

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

You can get in touch with Jason via LinkedIn

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Executive Summary

This document synthesizes a discussion on the intersection of Artificial Intelligence and data services, drawing from a conversation between Shane Gibson and Jason Taylor (JT). The core thesis is that while AI, particularly Large Language Models (LLMs), has dramatically lowered the barrier to entry for building sophisticated data services—especially in parsing unstructured data—it simultaneously demands a renewed focus on rigorous process engineering, human oversight, and robust evaluation.

Key takeaways include the shift in the data profession from role-based identities to a more fluid, skills-based approach, where transferable skills are paramount. The conversation categorizes data services into three primary use cases: internal business health, customer-facing data access, and external monetization, with AI impacting all three. A central argument is that unstructured data parsing is now a “largely solved problem,” thanks to models like Gemini that can interpret complex documents and even images with remarkable accuracy.

However, this technological advancement introduces significant risks. The concept of “blast radius”—the potential negative impact of an error—is critical in determining the appropriate level of AI automation, from human-in-the-loop “assisted AI” to fully autonomous systems. The speakers warn against “vibe coding” and the tendency to treat AI as infallible magic, citing high-profile failures (e.g., Deloitte, a lawyer using ChatGPT) as cautionary tales. The “maker-checker” paradigm is presented as a crucial process framework for ensuring quality and accountability. The discussion concludes that data professionals must apply their foundational principles of logging, testing, and healthy paranoia to the AI domain, continuously evaluating models and cross-validating outputs to build trust in these non-deterministic systems.

--------------------------------------------------------------------------------

1. The Evolving Data Career: From Roles to Skills

The dialogue begins by examining the career trajectory within the data field, highlighting a fundamental shift away from rigid job titles toward a focus on underlying skills and attributes.

1.1. Breadth vs. Depth and The PhD Dilemma

The transition from a specialized quantitative (”quant”) researcher to a broader data professional serves as a key example. This move is framed as a strategic choice between depth (e.g., heavy statistics, requiring a PhD to compete at the highest levels) and breadth (a wider data skillset).

Market Dynamics: The market often favors broader skillsets, enabling professionals to handle more of a project’s lifecycle end-to-end. As Shane notes, “...it’s easier if you’ve got a broad set of skills to be able to pick up a gig or do a role.”
The “PhD Barrier”: In highly specialized fields like quantitative finance, a PhD can be a de facto requirement. JT comments on this pragmatically: “I don’t have a PhD and competing against PhDs sucks.” This has historical roots in the 1980s and 90s when finance began recruiting physics PhDs for their expertise in signal processing, which was analogous to financial market analysis.
Stereotypes vs. Reality: While the market may have stereotypes about needing a PhD for certain roles, the speakers question the universal necessity, pointing out that “not all PhDs are the same.”

1.2. Attribute-Based Career Planning

A core argument is that professionals should focus on their inherent attributes and preferred activities rather than chasing job titles, which can be defined differently across organizations.

Focus on Skills, Not Roles: JT strongly advocates for this approach: “I hate that the role-based mentality... is for somehow perpetuated.” This is reinforced by Shane’s example of survival analysis skills from genetics being applied to supermarket product placement.
Data Persona Templates: Shane is developing a book on “data persona templates,” a skills-based framework. By analyzing job ads with a custom GPT agent, he has found that despite varied job descriptions, the underlying skill requirements often distill down to just three core personas.

Key Quote: “data scientists are just quants, or quants are just data scientists with more subject matter expertise. Like it’s all kind of the same thing.” - Jason Taylor

2. Defining and Monetizing AI-Powered Data Services

The conversation defines a “data service” primarily as a data-centric offering that generates revenue, distinguishing it from internal data teams that support a non-data primary business (e.g., selling ice cream).

2.1. A Taxonomy of Data Use

Shane proposes a three-category framework for the use of data in an organization:

1. Internal Use: Understanding and growing the business.

2. Customer Support: Enabling customers to access their own data (e.g., in a SaaS platform or bank).

3. External Monetization: Exposing data externally to generate revenue, which can include direct data sales or enabling partners.

The focus of “data services with AI” is primarily on the second and third categories, particularly where data is enriched or processed for monetization.

2.2. Models of Data Services

JT outlines several models for companies that sell data:

Pure Enrichment: A customer sends their data, the service does “something fancy” to it, and sends it back. The process is monetized.
Raw Material Sales: Collecting and selling data, often via methods like web scraping.
Integration: Providing specialty knowledge on how to integrate and organize disparate datasets.

Companies like Bloomberg are cited as examples that successfully combine all three models.

2.3. The Impact of Generative AI

Generative AI introduces a new dynamic: non-determinism. Unlike traditional services that sell a predetermined, consistent product, AI-based services sell something that “may be variable at times.” This fundamentally changes the nature of the product and the processes required to ensure its quality.

3. Unstructured Data Processing: A “Solved Problem”

A significant portion of the discussion centers on the claim that AI has made the parsing of unstructured and semi-structured data a “solved problem.”

Key Quote: “the one that’s exploded the most by a massive amount has been unstructured or structured data parsing... I feel like that’s a solved problem. Now do, maybe that’s extreme.” - Jason Taylor

3.1. From Tesseract to Gemini

The progress in this area has been substantial. In the past, extracting text from a PDF with tools like Tesseract was challenging, and even training specialized models like Google’s Doc AI yielded good but not “that good” results.

Now, modern models like Gemini Pro can process complex documents—including financial statements, org charts, and diagrams within PDFs—with “remarkable accuracy.” JT notes his surprise when he drops a document in and says, “give me everything,” and the model understands the content and structure exceptionally well. This has massively lowered the operational barrier to accessing this data.

3.2. The Diminishing Moat of Domain Expertise

Historically, the competitive advantage (or “moat”) for data service companies like Bloomberg or LexisNexis wasn’t just providing the raw data (which is often public), but the “many years of highly skilled and trained... professionals augmenting that raw data... with context.” This organization and curation is what created value.

LLMs are now diminishing this moat. They have “come a long way” and can infer much of the context that previously required thousands of human experts. While there is still value in tribal knowledge—”not everything we know is written down”—the gap has narrowed significantly.

3.3. The Power of Visual Interpretation

A key advancement is the ability of LLMs to interpret documents visually, not just as raw text.

Image-Based RAG: Processing image data (e.g., a screenshot of a report) instead of just the text can be “wildly more beneficial” because the model picks up on subconscious cues like layout, organization, and what else is on the page.
Use Case: Report vs. Dashboard: Shane describes a project where an LLM successfully categorized 8,000 legacy reports by analyzing screenshots. The model learned the human-like heuristic: “If I see a single table of data, it’s a report. If I see multiple Widgety objects, it’s a dashboard.”

4. The Imperative of Human Oversight and Process Engineering

Despite the power of AI, the speakers stress that its non-deterministic and fallible nature makes human oversight and robust processes more critical than ever.

4.1. Blast Radius and Appropriate Automation

The concept of “blast radius” dictates the level of risk and, therefore, the necessary level of human involvement.

Low Blast Radius: A mistake in a marketing campaign might result in spam.
High Blast Radius: A mistake in pharmaceutical trial data could lead to a death.

This leads to a hierarchy of AI implementation:

4.2. Failures of Blind Trust: “Vibe Coding”

The discussion warns against the dangerous trend of “vibe coding”—uncritically accepting and deploying AI-generated output. High-profile failures serve as cautionary tales:

A lawyer who used ChatGPT for a legal filing, which included fabricated case citations.
Deloitte being forced to repay a government agency half a million dollars after using AI to generate a report that “hallucinated a whole lot of case studies.”

Key Quote: “If you hired a genius level person... Would you read their work after they generated it...? I don’t give a fuck how smart you are. I’m reading what you put together... I’m accountable. So why in any of these circumstances would you not check this stuff?” - Jason Taylor

4.3. The “Maker-Checker” Paradigm

The solution to managing AI’s fallibility lies in process engineering. The “maker-checker” paradigm, a common process in manufacturing and finance, is proposed as an essential model for AI workflows. One agent (human or machine) creates the output (the “maker”), and a separate agent reviews and validates it (the “checker”). This builds in accountability and a review system, much like code reviews (PRs) in software engineering.

5. Evaluation, Testing, and Trust in Non-Deterministic Systems

The conversation highlights a cognitive dissonance where seasoned data professionals often forget their core principles of testing and validation when working with AI.

5.1. The Underinvestment in “Evals”

“Evals” (evaluations) are the AI equivalent of software testing. This is seen as a “massively important” but “under invested area.”

Complexity of Testing AI: Testing an AI system is more complex than traditional code because there are more moving parts that can change: the underlying LLM model (which vendors can update), the prompt, the RAG context documents, and subtle variations in the input data.
Methods for Evaluation:
- LLM as a Judge: Using one LLM (e.g., Claude) to evaluate the output of another (e.g., Gemini).
- Testing at Scale: Running a large number of tests, including edge cases and “chaos engineering” style random inputs, to understand the model’s boundaries.
- Ad Hoc Testing: Even simple measures like asking the same question multiple times to check for consistency in the answers is “better than nothing.”

5.2. Logging and Healthy Paranoia

Data professionals are trained to “log the shit out of everything,” yet often fail to apply this discipline to AI systems. Logging the reasoning path of an LLM is crucial for debugging and understanding its behavior, especially when an unexpected answer is produced.

A “healthy degree of paranoia” is described as a beneficial trait for data professionals. This involves an inherent distrust of outputs and a commitment to cross-validation. JT states, “I still crosscheck things. When I write code with LLMs, I read all of it, like all of it, I see my role as I am the reviewer.”

6. The AI Toolkit and Professional Practices

The speakers discuss their personal toolkits and workflows, revealing practical strategies for leveraging AI effectively.

A key professional practice is to always review AI-generated code. A common red flag is when the code references an outdated model version (e.g., Gemini 1.5 when 2.5 Pro is current), which is a “dead giveaway” that the user did not read the code.

7. Future Outlook: Tribal Knowledge and Creativity

The dialogue concludes by speculating on the future of knowledge and human creativity in an AI-dominated landscape.

Gravitation to the Mean: LLMs, by their nature, gravitate toward the mean or average of their training data. This could create a problematic feedback loop where human thought becomes less diverse.
The Pendulum Swing to the Arts: As AI automates more rote, scientific, and predictable tasks (”sciences”), there will be a “pendulum swing” where society places a higher value on uniquely human traits like creativity, randomness, and artistic expression (”the arts”). JT states, “I am very bullish on the arts.”
The Future of Tribal Knowledge: While some may try to hoard their proprietary knowledge behind paywalls, the speakers hope that technology will continue to lower the barrier to recording and sharing information. This could accelerate the advancement of human knowledge, as more ideas are documented and built upon. The belief is that we have not yet reached a point where “all thought has been explored.”

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

JT: I am Jason Taylor, or you can call me jt. Either one’s fine.

Shane: Hey, jt. Thanks for coming on the show. I think today we’re gonna have a bit of a chat around building data services with ai. But before we do that, why don’t you give the audience a bit of background about yourself

JT: Yeah, sure. I guess the simplest way that I usually explain it is beginning of my career was more quant research and data, and then I just gradually went towards the data. I don’t know if that’s towards the light or towards the dark, but um, let’s see. I worked facts at Palantir. Usually everybody wants to talk about buy side, whole bunch of different places.

Now I play in startup land because I have a death wish. I don’t know. Yeah working on a bunch of fun stuff,

Shane: what made you come from Quant To pure data.

JT: That’s a good question. And, and quant is a data gig in a lot of ways, right? And Joe Reese started in this as well, doing more quant research at one point in his career. Like I, I think there are different types of quants but I think a good majority of them, and as people have become more technical over time, they’ve become more data oriented.

Almost like sometimes I joke that data scientists are just quants, or quants are just data scientists with more subject matter expertise. Like it’s all kind of the same thing. But , why I think the very practical answer is I don’t have a PhD and competing against PhDs sucks.

Shane: I think it’s interesting. It’s this idea of breadth versus depth for me. I see people start out with heavy statistics, right? That’s what they love and that’s what they get into, and , that’s a very specific set of skills. And then often they’ll go into breadth. They’ll extend their skillset out and become more data and a little bit less steady.

And , my perception is because that’s where the market is it’s easier if you’ve got a broad set of skills to be able to pick up a gig or do a role and do more of the work end to end yourself than if you are a, a specialist with a really deep set of skills. I hadn’t quite thought about the PhD side, so I suppose, I always talk about, are you gonna be in the top 5% of your skillset? And I suppose and that area, to do that, you have to be a PhD

JT: So first of all, let me start by saying I very much subscribe to the thing you said where I want to capitalize on my strengths and there are certain things I know I’m good at. And actually Google came out with this cool tool where you can actually talk to it about a job and it will tell you it’s all AI oriented, but it will actually tell you attributes and things. I think you like to self describe your attributes and it helps pinpoint potential jobs that it believes in for you. And I, it’s actually really smart and I very much subscribe to this of attribute based or characteristic based job placement, if you will, as opposed to like, you know, there’s a ton of people that are like, oh, I wanna be a data engineer or a data scientist.

JT: And it’s great. Those roles are wildly different at different places. What things do you like to do? What attributes? So long-winded way of saying I very much subscribe to this. I think that there are certain things that I gravitated towards in terms of the attributes I like and the activities I like, like I like being analytical sometimes to a fault, which I think we all do in the data space to some degree. But then I also think the PhD thing maybe a little bit of, terrible history. But yeah, at some point in time, from a finance trading perspective people started leaning into math go figure. And there’s portions of finance that have always been in, in that space.

But at one point, I think it was in like the eighties if, yeah, more the eighties and May, maybe early nineties was when. They started pulling in like physics professors and the like, because people started looking across the aisle effectively and saying, Hey, you’re doing signal processing and studying these things, and hey, that’s the same as finance.

Like it’s the same thing. So people started looking at that type of math and that type of process that they were doing. And that’s, I think people started leaning into that, right? there’s a piece of this which is, I feel and I’m sure this will relate to AI really fast, but it’s just stereotypes, right?

Like people are like, oh, I need a PhD to do this. And I’m like, do you really, do you know what you’re doing? Do you think all PhDs are the same? But ultimately that’s a good chunk of the market. I’m not gonna discredit people having a PhD. I think that’s awesome. I definitely. Feel like I would’ve wanted to do that at some point in my life had I not made certain decisions.

But yeah it’s hard. There, those people are smart and it’s, especially on paper from a recruiting perspective you know how it’s some recruiters are just like, oh, here are the qualifications I was given, yay or nay. And that’s like

Shane: I think I think the recruitment process is broken. I think that whole industry is gonna be disrupted. I think we’re gonna move to more community or closed network based recruitment. Very much

JT: Aren’t we there though? Aren’t we already there?

Shane: yes we are for some, but yeah. If I look at, over here, the way the government over here recruits, my standard joke is somebody in a government agency generates a job request job description with ai, to which then the candidate uses AI to generate the CV to match the job ad to which then the recruitment agent then uses AI to see whether the CV matches the job ad, I was talking to somebody and they were advertising for an administrator. Office kind of administrator’s part of their process. And they were saying they had five perfect cvs, perfect matches all the experience, all the skills. And when they went to interview them, none of them had done anything near what the CV said they had.

So I think this personal recommendation, this personal network is gonna become more and more important. And the idea of skills \ being transportable. I remember many years ago when I was working at SaaS and I was watching one of the projects, one of the consulting teams did for a supermarket.

And they ended up using survival analysis to figure out placement of take away hot meals with something else. And, I didn’t do well at stats at school. I didn’t enjoy it. So it was always funny when I went to work for a company that was pure statistics. And I was like how does this work?

And they said imagine you got, a tall man and a short woman, and they have a child. They, there’s survival statistics on, which genes are gonna survive and will the child be tall or short? That’s what we use to decide whether we should put beer or Coca-Cola next to those meals.

And I’m like, I still don’t understand it, but it makes sense. And so it’s this idea that actually, like you said with the finance stuff, these skills you get in another domain and then they are really useful in an alternate domain. So focus on the skills, not the role.

and that’s where the value is. Yeah,

JT: Yeah, no, a hundred percent. I hate that that’s perpetuated. I hate that the role-based mentality for a lot of people is for somehow perpetuated that’s just how people think about career

Shane: I’m currently perspiring on my second book, which is around data persona templates. And so one of the things I’m doing is I need examples for the book. So what I’ve started doing is downloading job ads for data professionals. And then I’ve created basically a chat GBT agent based on the book for a whole bunch of prompting.

And I’m telling it to take those job ads and create the persona template for me.

And the persona template is skill-based, it’s all about skills. It’s really intriguing to take all these different jobs that look different, and then when you boil them down to what is the persona it typically comes up with only three,

Three core ones, there’s variations of it, but I’m intrigued by that. But this one’s not about data personas. This one’s about building data services with ai. Why don’t you explain what the hell you mean by that? And we’ll go from there.

JT: What do I mean by that? There’s a very easy rabbit hole here, which is like, what is a data product, which I’m going to avoid. We are not going to go down that hole

Shane: Ah, come on. Pedantic semantics says 45 minutes of us arguing a definition of a definition.

JT: Terrible. No, I’m not playing that game. What are definitions, Shane? Can we, yeah. Define definition. . If you ever see Joe Reese, make sure you ask him to define definition. Joe loves debating semantics if you didn’t know that already. So if you see him or hear him please ask him these questions.

He loves them. Anyways no. I’ll start with an interesting divide that I was actually talking to someone earlier about. I always find it interesting because the vast majority of the data space, usually when you talk to other people in data professions, they’re usually in some sort of supporting role, I’m gonna call it. And when I say supporting role, I mean that the data itself is not the revenue generating aspect of the company, Very basic examples. I work at a company that sells ice cream, so I help ‘em figure out how to best sell ice cream. But still the main thing that you’re selling their ice cream, right? There is an entire world of people like us that sell data. And I’ve lived in that world for a very long time. It largely focuses around things like finance, because finance is an industry that has to consume data in order to operate. There are other industries, I think marketing and advertising and things like that are also in, in this scope.

But there’s an entire industry of people that just sell data. So I think that, especially from a data services with ai, I think of a couple different things. There are the people that are pure enrichment based. Send me your data. I will do something fancy and send it back to you. There are the people just selling the raw material. Here’s the data I collected in some capacity. A lot of those people are doing web scraping, not all of them. And then there’s other people that do integration style work, and if you think about a Bloomberg, which is relatively a household name at this point, they do all of them, which is interesting, right?

They’re both consuming from the public domain. They’re also have, specialty knowledge around how to integrate the data, and then also how to enhance it if you give it to them. But now we’re in a new world, not to say generative AI is brand new. I think it’s relatively common knowledge.

Maybe not for everybody that generative AI’s been around for a little while. It’s just now very mainstream, very accessible. There’s 10 cajillion startups around it. But it’s very interesting because , of course the key. Aspect of it, which is it’s non-deterministic. So now I’m not selling a predetermined thing to some degree, I’m selling something that may be variable at times.

I think that, especially, services around AI today, there’s definitely no shortage of web scraping companies. , I think the one that’s exploded the most by a massive amount has been unstructured or structured data parsing, that’s exploded. And I feel and I’m curious of your opinion, and I’ll pause here. I feel like that’s a solved problem. Now do, maybe that’s extreme. Do you agree?

Shane: I think it’s like when people tell me data collections a solved problem. And every time we onboard a new customer, two or three of their source systems ones we know will and then there is always an outlier system where there’s a hundred customers in the world and the APIs are really badly documented.

And the data structures are a nightmare for us to try and understand. And I sit there and go solve problem my ass. I just wanna go back though. There’s there’s an interesting one. One of, one of the things I do when I kind of work with companies and I’m helping teams out we start this idea of this playbook.

And the idea of a playbook is basically a bunch of slides and they have two purposes. The primary purpose is to explain how the data teams work for any new data team member. So when you onboard, you read this and you you get a feeling of how they’re structured, how their workflows, what the culture of the team is.

And the second one is if you’re a stakeholder in the organization, it tells you how the team works so you understand how to engage with them how long you’re gonna pretty much wait, what your role is and that data work. And one of the slides that becomes really common for me now, right at the beginning, is saying that I can categorize use of data in three ways internal use to understand the health of the business and grow the business data to support the customers where the customer’s actually accessing their own data.

So typically as software as a service or a bank or insurance company. And the third one is where data is used externally for monetization. And that might be selling data that might be enabling partners to use whatever you have. If you’re an insurance company and you’ve got insurance agents out there that I treat that as external data, you’re giving them access to that data outside your organization based on your customers to make more money.

So I kind of like that framing, so what we are talking about when you talk about data services with ai, you are talking about that last one, data being exposed externally, securely to make money somehow. And it may not be selling data, but it’s definitely we are exposing that data and sharing it to monetize it.

Is that what you’re talking about?

JT: I think that is definitely correct. I also think it’s the middle one too which is like exposing their own data, if I heard that correctly.

The exposing their own data, I think is another one. Like a very common is just enrichment, it’s already my data. You may be adding or doing something to it, that then I’m monetizing the process as opposed to the data itself.

If.

Shane: Okay. I heard a podcast a while ago that was intrigued me, and it was around a massive US company that had digitized and augmented all the lawyer case study ti case history, thingies can’t remember which one it was. I seem to think LexisNexis, but I’ve probably got that wrong. And what was interesting here, what was interesting was, and it’s coming back to this idea of semi-structured or structured what they were saying is,

JT: too.

Shane: Yeah. You define definition. Actually, I have a definition of a, and then I was doing a group thing with Ramona and Chris Gamble.

And Ramona and I have a disagreement on the definition of structure that unstructuredand And it’s all around csv, right?

Anyway so what’s interesting about this is you think okay, we’ve got all this case history stuff, and it’s in books and there’s probably digital versions of those books. And so it’s a sole problem. Now I could go and grab all that content and digitize it and create a service that competed with them.

And yes, they’ve got market share. The, how do I find the market in that, potentially, but I, I get the impression it’s not cheap, it’s for lawyers. Nothing with lawyers is cheap. So I could probably disrupt them and Uber them, or Netflix them right?

JT: Yeah.

Shane: What the key was the augmentation.

It was the many years of highly skilled and trained legal professionals augmenting that raw data, even though it was unstructured with context and that context is where their moat was. So before we get onto, , that idea of, is the problem solved? Is that what you are seeing though, is that once you get this data and then you add additional context to it, that creates the value, that creates the moat. That’s harder for anybody else to actually go and breach.

JT: Yeah I’m just gonna start outright by saying yes, I agree with that, and I think that’s, again I’ll refer to Bloomberg, FactSet, s and p, et cetera, because that’s the world I know, which is, public company filings are public, right? You can go and you can download Apple or anybody else’s, 10 K or Q or whatever other filing. Cool, neat. There is some value there in being able to access that and make it easy to consume and blah, blah, blah. No discount to that. And there, there’s actually cool open source people doing that Exactly now, which I’ll come back to. But yes, I the work that they do to organize it, right? To your point on semantics and governance and all this stuff, it’s that organization that actually gives value to the data.

it’s not about just serving it up, it’s about making it usable with other stuff, or potentially integrating it, so on and so forth. So and is that all hardcore domain expertise? Some of it, not all of it. I imagine in the legal space it is far more, but the part I would want to come back to is that getting access to the data and the value provided in that just operationally dramatically lower, massively lower.

I don’t think anybody disagrees with that. I think that that integration, that domain expertise has also become more accessible. I think that even though, in that particular case, and maybe it was Lexi Nexus or something, has all these domain experts there I, I do also think that, lMS have come a long way. They can infer a lot of these things. It doesn’t mean they’re perfect, but it does mean that we’ve accelerated what usually took hundreds or thousands of individual people to fill those gaps. So the moat has diminished significantly.

Do I think that there’s still differentiation or IP and domain expertise? I absolutely do. things I’ve been talking to people about, like I I’m very comfortable with that idea. There’s a very basic idea, right? Not everything we know is written down, period. it’s just not written down, or it’s a little bit more out there. And at the end of the day, large language models, for the most part, gravitate towards a mean, that’s by definition what they do. Do I think that means they can’t learn these things? Not really. And I’m using learn very loosely in that context. Do I think it means they can’t learn these things?

Not necessarily, but I do think the tribal knowledge, et cetera. This is what people are trying to do with rag everything else, right? It’s just how do I shove my tribal knowledge into the thing so that it has the things it needs to do this. But yeah I, that is definitely what we’re talking about.

I think that the expectations and obviously what you can accomplish today is wildly different. And I think that especially from what you can do, the bar is tremendously lower. But I think the expectations are wildly off the chart. I think to the earlier point, unstructured versus structured There was a period of time when shoving a PDF into pick your favorite model, whatever that was sucked a lot. Even if the PDF was pretty modern, you could extract the text off great, cool, still not great. And we’ve come a substantial distance from that where, I, on a regular basis am processing PDFs just as things I’m doing as part of my day to day. And I really like Gemini. I’ll plug them. I don’t mind. I really like Gemini’s Pro model for the vast majority of things that I plug in there and bear in mind, these are all kinds of interesting financial statements from a variety of different providers. It could be org charts.

I’ve found org charts in there, which has been cool, diagrams, all kinds of weird shit. And they, with remarkable accuracy, just pull it off. And my favorite part is that, and not that I’m doing this professionally, but I think the first thing I always try and do just for absolute giggles, is I drop the document and I say, give me everything. And then I’m just like, let’s see what happens. Fuck it. Let’s see how good or how well it understands things. And I’m continuously surprised because, it wasn’t that long ago that, again you’d have a tesseract or one of these other platforms that’s very widely used and , widely accepted.

And even if you trained a model like Doc AI or any of these things, they were good, but they weren’t that good. Like I couldn’t just drop random shit in and be like, Hey, gimme, gimme the stuff. And then That’s awesome.

Shane: I remember zero days when, we talked about dropping an invoice or a receipt and having it just turn up in your accounting system with accuracy. And this is 10, 15 years ago from memory. Back then it was a hard complex problem.

Now it’s not. One of the things I think we’ve gotta be really conscious is, is this idea of blast radius.

So what I used to always say in the data space, ghost of data past is, I’ll work on a marketing campaign. Because, the data we’re gonna get is gonna be crap, and therefore the results we’re gonna get are gonna be okay for moving a lever, but they’re not gonna be accurate. But if I was working on pharmaceutical data for a drug trial, then , it has to be different because the blast radius are getting that wrong.

When somebody dies, the blast radius of an incorrect marketing campaign is you spam somebody. And I think this is gonna be the same with using AI against data services. The standard, you’ve heard about, probably heard about the one where a lawyer used AI for the thing to the judge and Yeah.

And and there’s a use case in Australia where Deloitte’s had to pay back half a million dollars to a government agency because they used AI to generate their very expensive review document. And it hallucinated a whole lot of case studies. So I think we’ve gotta be careful about where we use it.

But I can see that the domain knowledge it has now from all the data and tribal knowledge it stole is useful, right? If the blast radius is acceptable, that actually it’s good enough to look at that, apply some context and it’s cheaper and faster than a human doing it, and the impact of it being non-deterministic and getting it wrong.

JT: But on that note, and let’s, we can tie this easily back to governance and among other things and the, I guess the world that I’ve come from is very easy. In a lot of contexts to generate, I’ll use the Deloitte example because it’s fun to pick on them, to generate a report and just shove it out the door. Why do you do that? Fuck if I now, but it’s very easy to do that. And this is my current complaint with vibe coding as well, right? It’s just that people hit the button, they say, ah, it’s fucking magic. And then they shove it out the door and it’s for the love of God, do you read your prs?

Do you edit your own writing? Please go back and read what came out of the random black box. Places I’ve worked, especially, when we were at Palantir’s amazing. At shipping things fast, right? But I was with a bunch of people, I won’t say where, but like I was with a bunch of people.

I’d have to kill you if I had told you where. But and they were phenomenal engineers, very good at understanding the problem write code very fast. And I caught a couple times where I’m like, guys, just read each other’s stuff. Like it’s not, it doesn’t take, I promise it doesn’t take that long. And then, we got there really quickly. It wasn’t a big deal, but and they’re, again, all exceptional. So it was a fairly easy thing to do. But I know tons of people that are engineers that don’t read their prs or don’t have automated checks in place, or not linting, like all this.

Do you use Grammarly? If you write do you spell check shit? Like for the love of God? If you’re writing a paper that has citations, fucking check the citations. This is like basic stuff, and I think that people are getting so jazzed about the fact thinking it’s literal magic and just forgetting everything.

They’re like, what planet am I on? Hit the send button. Go for it. And it’s just yeah, just take. You saved 90% of your time that you would’ve otherwise spent writing it. You can spend another couple minutes just making sure it didn’t spit out garbage. And, I’ll rant on this for two more seconds.

If it was a person, if you hired a genius level person, let’s assume that you hired somebody tomorrow to help you with your job, that is a certified genius. Would you read their work after they generated it or would you just say here, okay, cool, and just submit it to your boss or your customer? I don’t give a fuck how smart you are. I’m reading what you put together, right? I need to know what it says. I need to know what I’m represent I’m accountable. So why in any of these circumstances would you not check this stuff like that? It just, it’s mind blowing to me

Shane: It’s an interesting one because as we know, l LM is non-deterministic. And so people go how do you trust that it’s doing the right analysis? And Juan Cicada had a great comment many years ago where he said how do you trust the human?

And I sat there and I was thinking, yeah, the number of times I’ve seen a data analyst come up with a number and nobody peer reviewed it, and we trust it because a human wrote some code, and I suppose the code’s deterministic, you can go and see the code and run it time and time again and get the same response right or wrong.

That response is, but it, that’s not the point. The point was you trusted that number, made a business decision and nobody peer reviewed the process or the code. And I’ll go back to , one thing you know is definitely with ai vibe coding at the moment, if you wanna see how dangerous it is. We’re a Google platform, so I love Google.

Actually I love their platform. I love their technology. I hate their partnering and I hate their marketing ‘cause it’s the worst in the world. But anyway just go on to Reddit onto the Google Cloud subreddit and watch how many students are going and buy coding with the Gemini API and pushing their code to a public Git repository and then getting whacked with a three to $30,000 bill within two days because their API key is publicly available and people are just scanning, get repos and grabbing those keys and slamming them.

Somebody should run their eye over the, and it’s like you just watch the unintended consequences of this democratization. But let’s go back to that image one. ‘Cause it, it’s interesting for me. So one of the things we did with one of our customers a while ago, they were moving from a legacy platform to a new data platform.

And they had, oh, I can’t remember what it was, but something like 8,000 Cognos dashboard reports. So they’re built over 20 years and

JT: People watch every single one though too.

Shane: They’re all active. When they asked which ones could we get rid of, the answer was none. And they’d spend a couple of months with a small team of really good data analyst bas trying to document them.

So all they really wanted to understand was how big’s the data estate, right? H how many of these do we have? What do they look like? Which ones do we migrate, rebuild, or migrate first? Which ones don’t? And what we ended up doing is we ended up building a tool called disco. Effectively they exported out all the Cogness reports as X ml.

Yep. That was definitely a disco

JT: I’m dancing for anybody

Shane: Uh, that’s right. Yeah. I have a habit of making t-shirts. So we have a T-shirt for 80 80. The disco people can buy it online if they really can. And so what we did was they basically pushed the XML files to us and then disco when and scanned it.

And we did a whole bunch of prompt engineering based on some patents to say, what’s the data model underneath them? What’s the information product canvas? What action and outcome do we think’s been taken, right? So we generated all this context and then pushed that back into a database so they could query it.

And that worked, right? There was some engineering we had to do because doing it for one XML file manually works, do it for a thousand repeatedly. You have to loop through. But the blast radius was small, right? Because really what they wanted is insight. And then what happened was they came back and they said, we documented the reports that were copied,

so this report looked like that report, but it had a new filter. And this report looked like that report, and it had one more column. So where people had just cloned the report and that helped them deju. But they came back with an interesting question and they said can you tell us which are reports and which are dashboards?

JT: Oh,

Shane: So, Hmm. Okay. And there was some business reasons why they wanted to do that. So what we tested was them uploading screenshots of the reports, dashboards. And what’s interesting is, yeah, Gemini and I think this, were back then, we, this is pre-pro 2.5, but even then it was good. It basically.

Did what a human did and said, if I see a single table of data, it’s a report. If I see multiple Widgety objects, it’s a dashboard. And it went through and categorized them and I was like, holy shit, that just makes sense. And the other thing I’ve been doing is uploading the information product Canvas as a screenshot.

So building a canvas with a stakeholder, taking a screenshot of it and putting it into the lms and then saying, give me the metrics, give me the business model give me a physical model. A whole lot of questions around it. Whereas in the past, what I’d do is scrape out the text for that and put the text objects in, whereas now I can just put an image in there and I get almost as good a response.

Now the key is the blast radius for what I want to do is understanding, , I want to understand quickly versus I’m not gonna go tell it to build an information product and deploy it. But yeah, I cut out a whole lot of effort and it feels magical.

JT: So I’m gonna, I’m gonna do two things. One yeah. Image based LLM use is awesome. I saw someone recently note how we’ve been talking about rag, but doing it purely on I image data as opposed to on the text itself has been wildly more beneficial. And that’s because, there’s subconscious cues there, there’s things that we pick up on when you look at the layout of the text, what else is on the page, how the text is organized.

And it’s not just about looking at the text itself. And that’s been hugely beneficial. So that’s, whenever I do data file processing today, extraction, structured, unstructured, that kind of stuff, it’s all visual oriented. I try to avoid scraping entirely. Now, I’ll give you a kicker, which I don’t think this is IP at all, but like a kicker is a Excel.

I can’t pdf an Excel document that’s just extra stupid, right? You can’t do images of Excel. But Excel is a wildly interesting thing. This is where we can get into structured unstructured, right? Excel under the hood’s, really what XML or that weird format that it uses, right? In all I would define that as semi-structured.

Other people may fight me on this, that’s fine. But I would define that as semi-structured. Because it has structure inherently, but it’s also variable in nature. So I consider that semi. But those documents are hard to understand because, hell, I’ve seen too many really shitty financial models that are just like 20 tables in one tab and I’m like, oh, for the love of God, why did you do this? Who are you and what kind of chaotic person? Organize your shit, like gimme a break. This is insane. No. Scrolling around thousands of rows left or right. This is wild. And you’ve seen these, like people build financial models, just the most ass backwards ways on the planet. But visually interpreting them, assuming that you can get away without the pagination or anything like that is, very good. It’s way better from a visual standpoint. But I think that the big thing, and I’ll circle it back to the main topic here, If you think about, now there are companies out there that sell data where, that have a data oriented process, right? And that is their main revenue driver. And then you think about shoving a large language model into that process in some capacity, mostly probably because it’s unlocking some new features for you, or you’re moving faster, you don’t need humans, blah, blah, blah. In some ways It’s not really different than it was before. And I know that sounds really weird to say, but the reason I’m saying that is because if this is your product you always had, whether you acknowledged it or not, you always had a need to set up proper process to ensure you have a quality product. So for me, when I think about large language models and their use in any of these processes, it’s all process engineering. Yes, context engineering, blah, blah, blah, blah. But like it’s process engineering really that we’re talking about. And especially. Once you start talking about multiple agents, I’m air quoting agents because that’s a different bag of tricks. But once you get, multiple autonomous processes interacting, like it’s all process architecture, right? You’re, I think the most common one that a lot of people talk about is a whole maker checker paradigm of you make it, I check it. That’s how it works forever and always.

We don’t cross lines that works reasonably well. There’s some sort of, accountability structure and review system and so on and so forth. Prs have the same thing and people still put in shitloads of additional automation, but it’s process structure, right? So even if my entire data product just for hypothetical sake is me just. Shoving a prompt and hitting play repeatedly and then sending it to somebody , you should still have some sort of review system. You should still have some sort of checks to make sure it’s not garbage, every major manufacturing, et cetera. Everybody has this and that’s why like, I still think the lawyer and the Deloitte example, I’m sure the PowerPoint Deloitte put out was huge and it had a billion references and it was blessed by the Pope and shit.

Like I, I’m sure it, it had all this stuff so it probably make it really hard. But we’re data people, right? Rip all the fucking things off, go cross validate them with a deterministic process flag. The ones that don’t like you, you can bootstrap your probabilistic process with deterministic shit.

It doesn’t have to just be like, everything’s tossed to the wind where you’re using a new tool just hail Mary and pray. It makes money and VCs will pay you I don’t understand that mentality, we’ve been doing this for a long time.

I am definitely the first one out the door to use AI and LLMs for things, but it doesn’t mean I’m gonna let it, drive my car care for my kid. Like I, I want

some structure and controls around no different than a human right? And I think it’s a lot of people poo poo the idea of treating LMS as human , making them human-like. But I think that it’s a very good analogy, it corroborates my feeling towards vibe coding, which is , why in the fuck would you approach an engineer and just say, build me a website, and walk away and think it’s just gonna work.

Like they’re gonna build something.

Shane: it’s even, worse now though, because you read people saying, my boss vibe coded over the weekend and gave me the code and told me to push it to production. Like there, there’s a problem.

JT: I have a feeling that’s clickbait though.

Shane: Yeah, probably. Although, I’ve met some managers. One of, one of the questions I’ve got around that Deloitte thing though was the first prompt in their agent always you telling it what the latest shape to use in the document, I like where pyramids this week or where circles or where matrixes, because you gotta change the shape of every document that, that was dark as.

And by the way,

JT: That’s where all the money comes

Shane: the interesting thing is this idea of make a checker. Is that what it make

JT: Make? checker. Yeah. It’s a process paradigm.

Shane: yeah. Is around factorization. It’s about repeatability. It’s around deterministic.

And then we have artists, we have craftspeople that make things that are just one and done. they make it once and it is not deterministic.

It is probabilistic. It is a piece of furniture

JT: I’m gonna fight you on this. I’m gonna fight you on this. I agree with you that if I am painting a painting, that it is easy to think about. I’m painting the painting and it just, I paint it and it’s over. But I don’t know, I don’t know if you do any art Shane, but I cook I’ll relate this to cooking. Do you cook?

Shane: Yes.

JT: Do you taste your food while you’re cooking?

Shane: Yes.

JT: Good. That is a good thing you should do. It’s not exactly make checker, but if you have a partner or someone you’re cooking for, sometimes you have them taste it, It doesn’t mean you have to have them check it just at the end and you get to redo the whole fucking thing. But at least having some process that ensures you are not going off the rails entirely. I make a lot of random stuff. A lot of the stuff I make I’ve never made before just because it’s fun and I always taste it midway through because you never know what might go wrong or you might, there’s little tweaks, more salty, more spicy, too spicy, that kind of stuff. But the make checker paradigm, I. that is a very particular paradigm . There are probably some ways that that Pattern is implemented where it’s you finish your thing, give it to me, I review, say whether you suck or not, hand it back to you. Might also be versions of that where it’s mid points,right? Like

Shane: Okay so let’s take that, make a checker idea and say that the process, that even an artist, a craft person checks it themselves, right? They may have another person that’s as experienced as them, or, but they are checking, right? They always checking their work.

JT: Hopefully to some degree.

Shane: And within, gen ai, I have three types.

I have what I call ask ai, which is where you ask it a question, you get back a response, you ask a question and you’re chatting with it. And then you go off and make the decision as a human, right? And ideally get another human to, to check your work. Assisted AI is where it’s watching what you are doing and it’s going based on what I know, you might wanna think about this,

so it’s prompting you. You can listen to it, you can ignore it, but you carry on, you finish that task. And automated AI is where the machine does it and no human’s ever involved. Yeah. It just happens. And so if we take that idea of PDFs And if we think about code being deterministic And LMS being probabilistic, and we think about if I wanted to just upload a PDF and get some tribal knowledge back about it. That is a probabilistic problem

I can put it up there. I’ll give you some stuff. I’m in the loop. I’m gonna make some decisions. So therefore it’s an ask kind of feature and the blast radius of me getting it wrong lives with me,

and am I make checker paradigm. If I was automating that PDF to go into my finance system and put in the number, then maybe I move to an assisted model, I upload it. The machines identifies all the fields for me. It comes back and goes, this is invoice number, this is the tax amount, this is the total amount, this is the supplier.

You happy, click go. So it’s a system, it’s automating all that Rossi ship, but I’m still making the final call of Yeah, that’s right.

Versus if I take it to fully automated. That’s when I’m dropping in a thousand invoices. They’re going into my finance system and a payment has being made, and I am not in that loop.

At that stage. In my head, I go back to code, I go back to deterministic code that is looking at specific places on the layer of that invoice and saying that is the number, and if there’s no number there, don’t take it from anywhere else. And so I would say at the moment, I would not use an automated gen AI solution in that use case.

That’s only because I haven’t tried it lately. Like you said, when I uploaded images two years ago, it sucked. I upload them a year ago. It’s got amazing. I haven’t, done it this week. It’s probably gonna freak me out. So where would you sit, right? When do you jump from assisted human in the loop?

Make a checker to let the bloody thing take this unstructured or semi-structured data and just human out of the loop.

JT: I’ll say a couple things. One, there’s the very obvious part that I’m gonna state because everybody’s gonna say crap about this, but security, obviously there’s a security element to this. We’re talking about financial statements, blah, blah, blah. Let’s remove that just for argument’s sake. So I’m gonna repeat again, we are removing the security concern here and the data privacy and all that shit, just to have a hypothetical conversation before people are like, eh, privacy, blah, blah, blah.

Shane: But hold on. What’s your definition of security?

JT: ask you, go call, call. I’m gonna give you Joe Reese’s phone number. You can call him. He loves to debate semantics. Anyways this is gonna sound funny. I don’t know where that line is, and I am actively and consistently trying to do it the automated way as much as possible, and I often equate a lot of these things to like meditation, right? This is all about building good practices. You have to set up the conventions in your mind, build the muscle memory to do the thing that you may not have done before. I’m with you it’s very easy for us as engineers to think about Hey, I wanna rip this one cell.

I know it’s in the same place every time off this document. Write the code to do it right? And don’t get me wrong, there’s an over-engineering element to this of throwing an LLM at a problem like that is definitely a bazooka. At an anthill, like a hundred percent. There’s also a time to market, I’ll call it component of this, which is how fast can you write that code compared to how fast I can go to a website and upload a file and ask a question.

I’ll quick draw with you and I’m willing to bet that I’ll win. And I’ll still get the same answer, right? And then there’s a middle ground if you really wanna fuck around, which is have the l lm write the code to do the deterministic thing. That’s a whole nother like I don’t have to have the LLM just do the work. I can have the LLM write the process to do the work. And then you get a little bit of, a little bit of both, Because it’s a deterministic process that was generated very fast. The, this whole name of the game is speed, how fast can I do X activity?

That’s our North Star. If we’re talking about expense parsing I literally just did this the other day, right? I dropped a PDF onto a platform and it read in the expense. Cool. Did I validate it? I didn’t actually, that’s not entirely true. I knew what the number was before I dropped it in and it happened to produce the right number.

And I was like, cool. So that’s my pseudo maker checker. Most of these platforms these days. And you made a comment before about. Trust. And trust and determinism, trust and code, right? Especially when we think about people. A lot of that trust is just based on transparency, Transparency, auditability, the ability to go back, And this is a very common paradigms in data. Like, how do I roll something back? How do I undo something? We were talking about this yesterday or the other day, right? Especially, from a version control has given us this wonderful sense of security. I can go back, I know what happened, blah, blah, blah, blah, blah, I can yell at Shane ‘cause Shane fucked it up. So we have blames. I think that, in this world where an LLM can do the work for you, again, from a cutting down time perspective, it does cut down the time. Maybe I put the document side by side with it, which is very common, of the old platforms and the new ones, Great. Look at the document. Here’s the value, here’s where the value came from now. Looking at a form, let’s just make this more complicated. ‘cause it’s fun, If it’s a financial statement, And there are a bunch of companies out there that just straight do this.

This is their business, right? Talking about AI data services. Their whole job is to take documents and PowerPoints, I won’t name names, but documents from investment funds as an example and pull off the values. Now there’s definitely an intelligent person out there saying, why the fuck don’t they just put the values out on an API?

And that’s a great question, but that would be logical and God knows none of this shit makes any sense. I know a lot of them put out these documents and they probably got fucking pictures on it and all kinds of stuff, and they’re, and hell, if they’re all the same, they’re definitely all different because why would they be the same?

That would make sense. there’s whole businesses centered around just ripping these documents, doing the OCRE, whatever. It’s one thing if it’s a table. And this kind of goes to the facts that Bloomberg stuff too. If it’s a table and you’re just like, I always want sell one, column one, row one, give me the number every time. Not super complicated, not a lot of assumptions that need to be made. And I think assumptions is one of the big things I think about when I think about LLMs and probabilistic patterns and things like that. And I can give you my convention there, but the number of assumptions it needs to make. Also relates to how much context you give it, how clearly you can describe the things it needs to do. So my usual grid of this is on one side it’s the number of decisions that you’d need to make. And then the other side is how much information you’ve given it, That’s any process.

It’s straight up any process. it’s for a human or a machine or anything at all. I could say, Shane, go make me a cake. I’ve given you no information. You have to make shit loads of assumption. You could make a cake out of mud and theoretically you’ve delivered, you’ve given me a cake, Or you could have made me a Lego cake for all I care.

That will also suffice, But that doesn’t necessarily meet what I have for expectations. So again long-winded piece here, but. If we did something more complex from a document processing perspective, A, hey, give me the revenue and the revenue’s in the cell, but then there’s four adjustments in footnotes and 12 other little things that you need to take into consideration. that’s how it gets complicated real fast. And we know in financial statements, this happens all the time. Like accounting for a lot of people seems like it’s a very route thing, and it is actually way more creative than you realize. So yeah, that’s where this stuff gets creative.

So to my process and doing things like this I always start small tasks. I always try to automate it if I can, if I’m comfortable with the security, et cetera. I always try. I have to start there to understand the bounds of what can be done or not done. And sometimes it’s process architecture to repeat that, or sometimes it’s how much information, am I clearly articulating the instructions, which is prompt engineering for lack of a better term, which is for humans, just fucking communication, which is always comical to me. I think that there’s no balance in some respect. You should always be trying to see what it can do again, I don’t think we’ve fully adapted to understanding I can use this every day. And that’s why I think it’s helpful for a lot of us to think of them as humans because it’s just oh, if I had an assistant, you’d know instantly what you’d have it do short of pick up my laundry, which it can’t do.

Shane: I think there’s there’s an interesting kind of thread in there that I’ve been thinking about, , and I’ll just unpick it. When we built our data platform, because we’ve been building data platforms for years as consultants, there were a bunch of core patterns that we knew were valuable.

And because we pay for the cloud infrastructure and cost, not our customer, we are hyperfocused on cost of that Google Cloud stuff, because it’s our money. We log everything. . Every piece of code that runs, we log the code that was run, how long it took you, we have this basically big piece of logging where we can go back and ask questions.

You, oh, are we seeing a spike on the service? Which customer you or customers is it one customer? Is it a volume problem? Is it a seasonal thing? Is it across all customers and Google are changing their pricing model, which they do on a regular basis. And so that’s just baked into our DNA. Yeah. Log the shit out of everything.

‘cause at some stage we’re gonna have to go and ask a question of those logs and we need that data. When we started moving into, AIing with our agent 80 we logged some stuff, but not everything. And as we saw our partners start to use her for really interesting use cases, they started saying, here’s some source data.

What kind of data model should I start modeling out? ‘cause we haven’t seen it. They have a piece of data and they wanna transform it. And so they’re saying, what we call transformations, we call ‘em change rules. What should the change rule look like? Talk me through how to create it.

And we just we saw those questions happening ‘cause we were logging the questions and, we started doing more context engineering to give her better access to things that gave her better answers. And what we didn’t do was when we moved to the, and air quotes here, reasoning models.

We didn’t ask her to log the reasoning path

Every time we asked her a question ourselves when we were testing something, we would ask a second question strike away of, how did you get that? Because we wanna understand where she’s grabbing the context, the prompt from. ‘cause that’s what we wanted to tweak.

And what was interesting is, our principle was log everything up, the kazoo, and yet we move into this AI world and we didn’t do it. And it’s and it’s of course once we saw. Once we saw our partners asking questions and we’re like, how the hell did she get that answer?

‘cause it’s not the one we wanted her to have. And then we ask her the same question and we get a completely different answer. We’re like, okay. And so then what do we do? We just put, into the prompt or into the framework, effectively log the path for every question you answer. And now we have a richness.

So that’s the first thing is it’s really weird how as data professionals, we have this baked in principles and patterns for our entire life. And then as soon as we move into this AI world, somehow we forget what we always do. The second point’s around complexity. And so one of the things I do when I’m coaching teams is I will ask them to draw me a map.

Draw me a map of your architecture, draw me a map of your workflow, draw me a map of your data sources. Like just draw me maps, Because I, I’m a visual person. And essentially what you say about, giving an L-L-E-M-A layout, a map, and distance between things similarity or clustering of things.

They are visual indicators as humans, we’re really good at using.

Shane: And so when I’m working with teams, the reason I want a map is the number of nodes and links. The number of boxes. The size of the map will gimme an instant identification or understanding of complexity. You have 115 source systems there that are going through 5,000 DBT pipelines.

I’m gonna go, that is a complex problem. You show me your team structure and it’s got, four layers of teams all handing off to each other and they’re in pods and squads and there’s 150 of them. You’ve got a complex business organization and team topology.

And so what’s interesting, because I was just thinking about as you’re talking about it, is if I take those reasoning paths and I basically do some simple statistics And I cluster two things, how many steps did it take that will infer the complexity of the context it’s trying to use in the task it’s trying to do.

And the second one is clustering around reusable paths where it’s constantly doing the same thing. Means that is almost a deterministic behavior, Because it’s constantly going through the same path. Where we see an edge case, an outlier, where it’s gone through a completely different path, we are like, Ooh, is that because, different question.

You just went and hallucinated for some weird ass reason. Or, yeah, there’s something interesting there because it’s different and we know different has value, which comes back to one of the things around eval, So there’s lots of work and it’s a new hot thing in the market is how do I eval my,

Yeah.

We used to call it tests, right? And, we know what data people like testing go back to my point of principles and patterns that we apply for our data work every day. And then in the AI world, we don’t, let’s talk about the ones we never apply. And now you get this whole idea of judging, So the idea of, if I asked the LLM the same question four times, and then I go and determine whether the answers are similar, and if they are the same or very close, then I should have more trust in the deterministic capability of that answer.

And that’s a presumably, I haven’t actually bothered to go and see any research papers to say if that’s true or not.

But is that what you do? Do you tend to use that as an eval process when you are doing all this work to apply AI to reduce your effort?

JT: Yeah. I always try and set up test harnesses of whatever kind, and I put evals in that same bucket. I agree. And, in. Our book club, We’re reading AI Engineering by Chip, which is a great book, talking all about the different evaluation methods and types, et cetera. I agree with the fact that I think this is a under invested area. I think it is a extraordinarily important area. And I equate it to testing in code, I think that there’s, a ton of code out there that isn’t tested for whatever ridiculous reason, And it’s always, people wanna move fast.

There’s businesses, blah, blah, blah. But, even if that means ad hoc testing, which is what you’re noting right now, run it a couple times and see the answer, that’s better than nothing. I’ve done a variety of methods now. I do LLM as a judge. I am still back and forth on this. So short of this, if you’re unfamiliar for anybody else it, it’s effectively, kinda like an adversarial, I have a system that generates something and then I have another LLM that more or less evaluates it.

I’ve done it with different LLMs evaluating others. So I might have Gemini as my model and I’ll use Claude as my evaluator or multiple evaluators, things like that. Or have it asked different questions from an evaluation standpoint, I’ve done that a variety of different ways.

That’s been very useful. I think anything where you can run your tests. At scale. Scale might be, that’s a very overloaded term. A any, anything where you can run a lot of tests, not spot tests, A lot of tests and a lot of extreme tests, I always tell people the QA joke about the bar, you know that joke, right?

Shane: I may do, but tell it in

JT: I’ll tell it anyways. I’m gonna butcher it. But the joke goes something like, somebody builds a bar, qa tester tests the bar orders one beer order, 99 beers, order a million beers, everything works fine. First guy in the door says, where’s the bathroom bar explodes? Great joke. Very appropriate in this circumstance, right? Of you don’t know what people are gonna ask. And I understand that large language models obviously are an NLP thing. It works on natural language, blah, blah, blah, blah. There’s many things in that case that are open-ended, so to say. And I think that the chat paradigm, especially as a UX paradigm, introduces a wild world of open-endedness, That, and anything can happen, And I talk to people that are building, again, AI products, and I’m like, do you want me to come to your product and ask you what my favorite color is? Because what what’s your thing gonna do then? You may have built this wonderful chat bot for finance or something, and I’m gonna walk up to it and be like, how do I make a chocolate souffle? And it might gave me a wrong answer. I might be like, this AI is terrible. I walked up to my finance AI and asked it where to drive to for lunch and it didn’t work. I was like no shit, but there’s a piece of that. I was just like, it’s not intended to work. So the open-endedness is pretty wild. But I do appreciate, there’s a bit of, this was just like almost chaos engineering, which is just I just wanna let it like just blast it with like random stuff. I also think that, having a prebuilt dataset, you should have some tests almost like unit tests to some degree. Ask it a question, you should get an answer. Then the next problem to your point is how do I evaluate whether the answer’s correct or not? And that, that is hard in and of itself. But there are conventions, ways you look for certain keywords, values, things like that. So I do think that evaluation’s massively important. It’s very easy to spot check a couple questions and just be like, yeah, it’s good. Move on with your life. And then you’re handing people free cars because they fucked with your chat bot, So it’s massively important, And I think that the more we can in a manner I speaking, beat the shit out of the machine and like literally test it to the nth degree. Even if you are using a language model, I think that’s helpful because the language model’s gonna come up with shit you probably didn’t even think of, like jack the temperature up and just tell it, ask random questions and make sure it doesn’t ask the same question twice through all I care and then you, this is our make checker a little bit can go review and say, Hey, this make any fucking sense. Or did it just spit out random garbage? I think all that’s super important. Like evaluation I think is very understated right now.

I think people are catching on a little bit, but. You said it before it’s very comical right now that we’ve got this new toy and everybody just forgets who the fuck they are and where they are, and they’re just like, yeah, cool. Let’s do it. You’re just you’re,

Yeah. You’re a seasoned professional.

Shane: Ah, And an organization. I was talking to somebody the other day and they’re in a large organization where security is key. And then they were saying that their cloud provider, somehow the LLM part of the platform got turned on for the whole organization.

JT: Ooh, fun.

Shane: Wow. And so it’s interesting how this AI thing seems to change, behavior.

People, you and again,

JT: that case might be a mistake, just like a human error.

Shane: yeah, maybe. But you do see, organizations that care about security and governance and privacy, and then you see people in those organizations grabbing things and putting them in their personal LLM. cause it’s so easy now to take a screenshot, and yes, they’re doing the wrong thing, but it’s amazing how human behavior is so different. when we talk about evals, we’ve gotta go back to complexity. Let’s just go back to data. If I have a piece of data and I have a piece of code and I have a, an assumption or an assertion I can define what the assertion is.

I can run that code against that data and the data doesn’t change, the code doesn’t change, and it can tell me whether that assertion is right or wrong. That’s three moving parts in that test suite. Why we don’t do that a lot really intrigues me. Now let’s talk about that within Gen ai.

Oh, actually no hold on. I’ve got. The data or the thing that I’m using, right? The input, PDF piece of data, I’ve got the question that I ask it, which is effectively the code to a degree and I’ve got my answer, my assertion that I expect. Ah, but actually I’ve got an LLM model, right? Which may or may not have been updated by the vendor without me knowing about it.

Ah, I’ve got a prompt. I’ve got a piece of text that’s actually embedded in that process that may or may not have changed or be interpreted differently. Ah, I’ve got rag or context, I’m pushing at some other stuff. That may or may not have changed. Ah, the PDF’s exactly the same except the dollar value for the invoice moved down 25 pixels.

Now that’s not gonna make a difference or does it? This is where we get to is that actually now if you do nodes and links for the process, we have lots more of them. And therefore, the complexity of what we’re testing the things that can change is massive. And talking about change again, one of the human natures I’ve found is once I start using a model or a tool, I get stuck.

I use chat gpt a lot for writing, because that’s what I started out with. I use perplexity for searching now. Over Google. So I perplexity. Now I don’t Google. We use Gemini for our platform because we’re bound to the Google Cloud. I use Claude a little bit for MCP testing, but I don’t use it a lot because I don’t code and I’m sitting there going, I can’t remember the last time I actually tried to decide when I would work out whether there are better models for what I want to do.

For the thing that I use chat GPT for. So how do you deal with that? Given the models are changing all the time and that the models are, tend to be good at specific tasks how often do you reevaluate the tool or model that you are using in the work that you do?

JT: not as much as I think I should. But I evaluate the new ones as they come out. So , when GPT five came out, like I reactivated my open AI subscription, I really don’t like open ai. I don’t know what it is. I maybe they, I just like, feel like they’re like evil empire or something.

I, like everybody else started using their products when they came to market and, even before chat days. And I’m not gonna be that guy that’s just oh, I’ve been doing this before. But I used that stuff before and it was interesting and I was very curious about it, especially given like the work that I do. when, Chachi PT came out, I was definitely all over it and I was very fond of it. So I will tell you that I very often use for understanding information. I use Gemini, I appreciate that one. I am a Claude fucking junkie when it comes especially to coding, And I’m a heavy cursor user, like very heavy cursor user. And I will tell you my favorite thing that has come out period, that no offense to you, Shane, but I have definitely done some work while we’ve been talking, right? And this is my favorite thing. I have my ticketing system wired up to background agents or cursor.

So like I write. Reasonably verbose tickets that explain exactly what I wanted to do. And for context, like the scope of these tickets is like, Hey, you were taking in parameters one, two, and three. I don’t want that. I want it as one parameter that looks like this. And then I want you to do this with that. That’s the scope of tickets that I’m writing because I want to be in control of it. And it’s like a micromanagement approach. I write the ticket, I say Go and it writes it. And I just keep on with my day. So I could be like walking my kid or my dog or somewhere and I’ll just fire fucking tickets off.

I’ll be working all the time. It’s amazing. That’s my favorite feature by like a absolute mile and a half. And I love Claude 4.5. I think it’s amazing. And the code quality’s phenomenal and I’m probably paying them a disgusting amount of money. But it’s, I think it’s totally worth it right now because I can move epically fast and multitask.

so Gemini for information, Claude for code open AI for nothing unless I like, just want an alternative opinion. And I definitely consider them all like different people to some degree. I very much think of them like people, they all have their different quirks .

Some things are good, some things are bad. Some things are like, especially from a code perspective, some of the models are more overeager than others. Some of them wanna try and cover more edge cases than others. And I, sometimes I’m just like, no, just fucking do what I told you to do. I don’t need you to do 20 other things.

And I have all kinds of rules set up. So I have it set up in a way that it will do the extra things I told it to do and not the things I don’t want to do to some degree, which is very nice. I’ll make a separate point. And I’ve always been this way for better or for worse, but a healthy degree of paranoia I think is always good as data professionals, like healthy paranoia. There is unhealthy paranoia, but healthy paranoia. I am a big believer and I’ve met many people on my career and I think a lot of us agree with that level of just distrust, inherent distrust, And it’s not to say that we don’t trust everybody. I actually think I’m a reasonably trusting person, which I’m sure someone could take advantage of me just hearing that. But I still look up the things that I asked the LLM to do, to put this into context. I still crosscheck things. When I write code with LLMs, I read all of it, like all of it, I see my role as I am the reviewer. I am the checker, I always joke with people, it writes faster than me. That’s why I use it. It types faster than me, but I am always reviewing what it did. If it does extra things, great brownie points.

Take some, extra credits so you can, whatever, I’m very content with that. But I absolutely read everything as much as humanly possible. And sure, I do random experiments where I just tell it, build me some shit and I don’t really care. And that’s a fun experiment. But I’ve worked on a project recently that I’ll tell you about where, without going into too much detail, the project wrote something in a language I’m less familiar with and I read some of it and I definitely spent some time. I used an LLM to do this and to teach me about the other libraries, teach me about the other things that I’m just not as familiar with. So I spent some time doing that in, different LLMs, different context windows. So like hopefully there, there’s some, call it arb that’s happening there where, I’m hopefully getting the right information.

I also do Google things still. I don’t perplexity things, I think I perplexity things just sometimes ‘cause I wanna see what the answer is. But I do still Google shit. Cause , I’m not gonna trust it unless it’s like if it, if I asked it to write something in Beam, I better, I’m gonna go to the fucking beam website and check to see that it did it right.

And especially from a code perspective. And they may have fixed this, but my favorite like red flag reading somebody else’s generated code. One of my favorite red flags. There’s plenty of them. The comments and all this other shit. My favorite one for a long time was that, especially when you had it generate an AI model or something that used the language model, it usually wasn’t trained on enough information to know what the latest model was.

So it was always a really big dead giveaway when you generated code to use like Gemini, and it puts 1.5 in there. And I’m like, you didn’t read this I know you didn’t read this because it’s using a model that we all know is a year or two old now, but you didn’t spend any time just looking like that’s a really basic thing to look at. And you didn’t even look at it, , I usually demerits for that person. That’s a very glaring thing for me because it’s just read the code. It’s not that hard, I promise. It’s not that even if you skim it, it’s not that bad, it, it’s better to do that.

So to the point of do I evaluate the tools, do I do that? Yeah I absolutely still do. I definitely still spend time Learning about the stuff outside of my use of it as well. So that way again, same as a person, Even if you meet the smartest person on the planet, I still wanna have my own authority.

I still want to ha fact check people. This, maybe there’s a very existential philosophical comment here about between fake news and the internet and or the web, sorry, Juan, the web, So I’m a big proponent of cross validating things. I just believe that and sure, not everybody has time to do shit like that.

I definitely don’t do it for everything. There’s some things I just take face value, but I do believe in that, And I have, I have a little kid right now and I’m in the phase of why, which I love, I really love this

phase. I’m sorry, my, I think my wife hates it, but I love it because. he asks why he’s not at the annoying why phase, but he asks why for a lot of things. And I take it as an opportunity to go look them up

and we talk about things. And if he asks a subsequent why, I’ll look that up too. I don’t give a fuck. I love looking things up. But you, you should feel more empowered to do so. and I feel like people just are short attention span. They’re like, just don’t fucking care.

Shane: I think you’re right. I think that idea of trust, but verify and, coming back to your Gemini 1.5, they probably don’t know there is a Gini 2.5 because they haven’t done the work, They haven’t done the reps to become an expert or any experienced in it to know that actually there’s a difference between 2.5 PRO and 2.5.Flash.

JT: Yep.

Or flashlight, or there’s flash with dates after them.

Like there, there’s all different fucking shit.

Shane: Yeah, so just wanna look at time. I just wanna close out with two questions for you. So right at the beginning we were talking about the lawyer stuff and,

Expertise and tribal knowledge. And given that all the LMS have gone and stolen tribal knowledge that was publicly available and often privately available,do you reckon the world’s gonna move, that humans who have tribal knowledge are gonna stop documenting it in a way that it can be found?

Because actuallythe value now is tribal knowledge.

JT: I wanna say no from like a humanity standpoint, but I do think that, think the world will become a lot more polarized. I’m gonna explain this in a weird way, and I’m not intending it. I know that there’s a political backdrop and everything as we’re recording this, but if the model inherently gravitates towards the mean, always gravitates towards stereotypes, that kind of thing, and we are having a feedback loop to ourselves on this, our own opinions are going to be potentially squeezed, Like kurtosis and all that fun stuff, if you wanna get into that. Do I think that people will stop? I think people will use these tools in more automated fashions and then that will create its own probably problematic feedback loop. I do, however, think that the pendulum will inevitably swing the other way where. I think people will still use these tools, still use tech to create content that maybe they didn’t record before. I don’t have a substack or any of this stuff. Yes, I’m talking to somebody about recording a podcast, which is why the fuck not even though we’re on or no. It’s mostly from I wanna help their business kind of thing.

But I think that hopefully technology will lower the barriers of entry to recording more. I hope that inevitably forces the pendulum to swing the other way at some point in time. And I remember reading an article recently, something about someone saying how there’s no second internet, there’s no second web. Really, again, I apologize. I like Juan’s perspective on this. It is the web, not the internet. And I I, I make actually a mental note, like always to correct myself. So thank you Juan for that. But there, there’s no second web like we’ve trained these things on so much information. At some point in time I do think we’ll start recording more and then there’s a very interesting dynamic of, okay, now everything’s recorded. To your point, tribal knowledge at all. What’s left? I’d like, and maybe this is overly optimistic and potentially naive, but I’d like to think that if we continue to use these tools of whatever capacity, we’ll advance our knowledge faster and therefore there will be more knowledge and more interesting ideas and creative ideas and I’ll emphasize something different that is related to this question, which is. I am a huge proponent of the fact that I think that especially as the pendulum swings back and forth, I am very bullish on the arts, to your point of painting earlier, Art is inherently creative. It’s inherently random. It’s inherently wild. There’s no predicting that you can generate pictures and stuff.

We all know how shitty those come out. And some of them come out great, but for better or for worse, art is something I don’t think a lot of us understand. We still don’t understand how we work. And I think that in comparison to science and science, I apologize to anybody that’s in science. I’m not trying to demean your field, but I think that I’m gonna call sciences things that are route in some nature.

They have predictive process. End results, so on and so forth. So I’m very bullish on art and I think the sciences, quote unquote, are gonna be marginalized away, and that’s part of this whole cycle in knowledge, et cetera, et cetera, as it learns more. But I do hope that technology encourages more people to, maybe just do random shit, write down more, write down random ideas, write down everything. I don’t think that we’ve gotten to a point where all thought has been explored, right? That’s what we’re talking about to some degree. or all possibilities of thought or captured.

Have you seen Jepson stuff?

Shane: nah.

JT: Look, okay. I will plug Vox right now because I, it’s one of those things that you watch it and you’re like, whoa, like this is cool because. He’s trying to replicate, like evolution, an evolutionary thought as like part of his process. And it’s like a, I’m sorry Jepson if you’re hearing this, but it’s like a very sophisticated Monte Carlo in some respects. It’s try everything. Or random forest, whatever you wanna say. And it’s way more sophisticated than that. I’m generalizing here, oversimplifying, but I do think that there’s something that we have that the machine doesn’t yet in its creativity and that’s art. And I think that’s, and I hope that there’s a pendulum swing. And don’t get me wrong, I think that the vast majority of society is gonna gravitate towards me and it’s gonna be some massive in acidification of fucking everything.

But I do think that for people that are creative thinkers like that, this is the time.

Shane: I agree. I I think about data products without defining them or defining the definition of defining. And I’ve reviewed two or three that are currently being drafted around that. Some are what I would call bringing product thinking to data, and some are around a thing called data product.

That’s how I categorize them after reading the draft content. I know a couple of other people that are starting to write something around product thinking with data or data products as a thing. And my view is good. The fact that there’ll be five or 10 books in a similar subject space, the way they’re telling the stories is different.

the storytelling is different for every book, even though it is relatively in the same domain or subject.

And that’s great. ‘cause what that means is, they can write them ‘cause writing is easier now, and they can publish them because publishing is easier now. And then I can read them and I always pick up something new, something I didn’t know, something that entertains me or educates me.

And then I assimilate that into how I think, so I’m with you push more out. But I think we will see some people try to put paywalls up because that’s the trouble and knowledge is how they make money. interesting times. I think the other one and we’re outta time for this one, but I think given the fact that tokens are subsidized so heavily at the moment And we use the AI tools or the gene AI tools. To automate stuff for us because it is faster and it is cheap enough that we don’t have to care. It’s gonna change. And when we get out with the true cost of those things, then people who have built good automation who have built things that are efficient and optimized and aren’t lazy processes or lazy code they’re gonna be better.

And the people that just put a thousand PDFs in there with no context, no prompts and get an answer, probably your products will be one of the many AI startups that die a thousand deaths,

JT: Or maybe gets bought.

Shane: Yeah. And then let’s say, yeah, there is a difference between being bought and being Acqua hired.

JT: Valid. Valid.

Shane: So yeah. Again, what’s your definition of buying? Or what’s your

JT: Yeah.

Shane: of different lighting? alright.

JT: you put a flame to the, yeah.

Shane: Just to close it out if people wanna find you and find what you are thinking you’ve already said you don’t have a Substack, you might have a podcast

JT: Website. LinkedIn’s the easiest way to find me.

Shane: LinkedIn and Practical Data Discord. Because

you are one of the more active people in the community. That’s how we met.

JT: correct.

Shane: join us. It’s free.

JT: It’s a great, it’s a great community. I always advertise it to people. I’m like, just yeah who do you talk to about your business? Come talk to us.

Shane: Yeah. And we will talk back.

JT: Sometimes we might just tell you to fuck off, but that’s fine.

It’s also cool.

We’re very unhinged.

Shane: You might get a meme you’re gonna get a meme

JT: we’re a very unhinged group.

Shane: Thanks for that. That’s been an interesting chat. We kinda went all over the place, it’s good to talk to somebody that’s actually using AI to build data services to in production and ex and monetizing it as a, as, as much as possible, ‘cause that is an art or a science one of the two.

JT: art.

Shane: Alright, excellent. I hope everybody has a simply magical day. I.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Can AI tools bring back data modeling with Andy Cutler

Shagility — Tue, 23 Dec 2025 19:21:57 GMT

Join Shane Gibson as he chats with Andy Cutler about the art of data modeling and the potential of AI tools to improve the art.

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/can-ai-tools-bring-back-data-modeling-with-andy-cutler-episode-78/

Listen to the Agile Data Podcast Episode

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

You can get in touch with Andy via LinkedIn or over at https://linktr.ee/andycutler

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Executive Summary

This document synthesizes a discussion between data experts Shane Gibson and Andy Cutler, focusing on the persistent challenges and future direction of data modeling within the modern data landscape. The central argument is that while data technology has rapidly evolved toward accessible, powerful cloud platforms, the discipline of data modeling has been neglected, creating a significant knowledge and practice gap.

The conversation identifies a “repeated and constant battle” to prioritize modeling over the immediate appeal of technology, which provides instant feedback that modeling processes lack. This issue is compounded by a decline in traditional mentorship, where senior practitioners historically guided newcomers. Modern data platforms from major vendors like Microsoft, Snowflake, and Databricks are criticized for lacking integrated, opinionated tools that guide users through modeling processes, forcing practitioners to manually implement common patterns like Slowly Changing Dimensions (SCD) Type 2.

The primary conclusion is that Artificial Intelligence, particularly Large Language Models (LLMs), presents a transformative solution. AI is positioned not merely as a code generator but as a new form of mentor and assistant. It can educate novices, generate starter models, and, crucially, act as an “antagonistic” agent to stress-test models for future flexibility—replicating the critical feedback once provided by experienced data modelers. The effectiveness of these AI tools, however, hinges on providing them with opinionated constraints and clear business context to generate practical, fit-for-purpose models rather than theoretical, unimplementable ones.

The Evolution of Data Platforms and Recurring Patterns

The discussion begins by contextualizing the current data landscape within a 25-year evolution of technology. This history highlights a recurring cycle of platform development and a significant shift from capital-intensive, on-premises infrastructure to flexible, cloud-based services.

From On-Premises to Cloud: The journey is traced from early-2000s technologies like ColdFusion and SQL Server 2000, which required purchasing and managing physical hardware (e.g., “compact three eight sixes”), to the advent of the first cloud data warehouses like AWS Redshift, and finally to modern platforms like Snowflake, Databricks, and Microsoft Fabric.
Democratization of Compute: This shift democratized access to powerful computing resources, moving from multi-thousand-dollar hardware purchases to pay-as-you-go cloud services.
Recurring Cycles: A pattern is noted where the industry moves from installable software to pre-configured appliances and now to cloud-native databases. Despite these technological waves, fundamental challenges, particularly in data modeling, reappear. As Gibson notes, “every technology wave it seems to become hot and then cold.”

The Persistent Challenge of Data Modeling

A core theme is the struggle to maintain the discipline of data modeling in the face of rapid technological advancement. It is often seen as a difficult, time-consuming process that lacks the immediate gratification of working with new tools.

A Constant Battle: Andy Cutler describes a “repeated and constant battle to make sure that data modeling is at the forefront of a data platform project.” He argues that modeling is frequently deprioritized in favor of focusing on technology.
Architecture vs. Modeling: A common point of confusion is the conflation of data architecture patterns with data modeling patterns. Cutler clarifies this distinction: “The architecture enables the modeling. The modeling is put over the architecture.” He notes that patterns like the Medallion Architecture are data layout patterns, not a substitute for disciplined modeling techniques like Kimball or Data Vault.
The Lack of Instant Feedback: A key insight is that technology provides immediate, binary feedback (it works or it doesn’t), which is psychologically rewarding. Data modeling, in contrast, does not. As Gibson puts it, “I can’t get instantaneous feedback that my model is good or bad or right or wrong... a model that you’ve created six months, a year down the line when all of a sudden something happens... the model isn’t flexible enough.” This delayed feedback loop makes technology more appealing to practitioners.

The Decline of Mentorship and the Knowledge Gap

The conversation highlights a critical loss of institutional knowledge transfer. As tools have become more accessible and projects faster-paced, the traditional mentorship structures that trained previous generations of data professionals have eroded.

The “Grumpy Old DBA”: Learning was often driven by experienced seniors, colloquially the “grumpy old DBA,” who provided critical feedback and guidance on performance, design, and best practices. This hierarchy of mentoring was essential on expensive projects where mistakes were costly.
Erosion of Foundational Concepts: With modern, abstracted tools, new practitioners are often not exposed to foundational concepts. The example cited is a user asking, “what is data persistence?”—a concept ingrained in older professionals who used tools that required manual saving (e.g., pre-cloud Excel).
Lack of Accessible Learning Resources: While foundational books from authors like Steve Hoberman and The Kimball Group still exist, formal courses and guided learning paths for modeling are less prevalent. Unless actively guided to these resources, newcomers may not discover them.

The Inadequacy of Modern Data Modeling Tools

A significant contributor to the modeling gap is the lack of robust, integrated, and opinionated modeling tools within major data platforms.

Vendor Agnosticism: Vendors like Microsoft, Databricks, and Snowflake avoid baking specific modeling methodologies into their platforms. They provide a “canvas” and “paintbrush” but “don’t help you draw the picture.” This forces users to bring their own process and often use disconnected, third-party tools.
The SCD Type 2 Example: The implementation of Slowly Changing Dimension (SCD) Type 2 is a prime example of a common, well-defined modeling pattern that largely lacks out-of-the-box support. Practitioners are still required to write custom code to handle historical tracking, even though it’s a fundamental requirement in dimensional modeling. Databricks (Delta Live Tables) and dbt (Snapshots) are noted as exceptions that offer some built-in functionality.
From Conceptual to Physical: There is a lack of end-to-end tooling within platforms like Microsoft Fabric that facilitates the entire modeling lifecycle, from conceptual design through logical design to the automated generation of the physical model.

Artificial Intelligence as the Future of Data Modeling

The discussion concludes that AI, particularly in the form of specialized LLMs, is poised to fill the void left by declining mentorship and inadequate tooling. AI can act as an expert assistant, a sounding board, and a critical partner throughout the modeling process.

AI as Educator and Mentor: For those new to the field, AI can act as a guide, explaining different modeling patterns (e.g., Dimensional, Data Vault, Third Normal Form) and helping to translate business requirements into an initial model. This helps bridge the knowledge gap. The tool Ellie AI is mentioned as a specific example of an LLM-powered tool focused on guiding users through data modeling.
From Generation to Antagonism: The most powerful application of AI is not just in generating a model, but in stress-testing it. The concept of using an AI to be “antagonistic” is raised, where the user can prompt it to find weaknesses and potential future problems.
The Power of Opinionated AI: An unconstrained LLM may default to the most prevalent pattern in its training data (likely Kimball modeling, due to the volume of public content). The true value emerges when the AI is given specific constraints and opinions. Key inputs that improve AI model generation include:
- Source Context: Providing the AI with source schemas and metadata.
- Design Patterns: Instructing the AI to use a specific, opinionated modeling pattern (e.g., “concepts, details, and events”).
- Business Boundaries: Using artifacts like an “information product canvas” to define the specific business outcomes the model must support, preventing it from over-engineering.
Multi-Agent Approach: A proposed advanced approach involves using multiple AI agents with different perspectives (e.g., one focusing on source systems, one on business processes, one on reporting outcomes) and having them “antagonize each other” to arrive at an optimal, pragmatic model that balances all constraints. This mimics the cognitive process of an experienced human modeler.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Andy: And I’m Andy Cutler.

Shane: Hey, Andy. Thanks for coming on the show. Today we’re gonna have an intriguing track around data modeling and whether it’s been done not being done, gonna be done and how tools and AI can fit into, it’s it’s a passion of mine. But before we jump into that conversation, why don’t you give a bit of background about yourself to the audience.

Andy: Yeah, sure. Thanks Shane. so my name’s Andy Cutler, and I’ve been working. In the data space for about 25 years. So back in 1999 when I was working for a tiny little UK drum and bass record label called good Looking Records I was given the role of updating the company website and that ran on a technology called ColdFusion and I got quite into web design.

So I then managed to get myself a job, and I guess it was the first IT job that I had designing websites. Then that website agency bought a SQL Server 2000 license, and I know there’s older versions of sql. I’m, really aware of that. But that was the first version of SQL that I got my hands on, and I was tasked with building the data models and the store procedures to run CMS systems.

And then five years later, so mid two thousands, I am working in the data warehousing space. So I’d learned everything about database normalization, third normal form, so on and so forth. And then I was told to unlearn that all because I needed to de-normalized everything for data warehouses. So I’ve literally worked for the last 20 years in data warehousing.

The last few years has all been cloud-based. I actually started my cloud journey with AWS and Redshift. That was, when we’re going on for sort of 12, 13 years ago now. And then when Azure really started to motor with the products and services, I was then using tools like Azure, SQL data Warehouse.

That then moved into Synapse Analytics and for the last couple of years it’s been Microsoft Fabric. So really it’s been predominantly the Microsoft data space that I’ve been working in. .

Shane: Excellent ColdFusion and SQL server in the early days. That was back in the days where we had to buy our own hardware and put it under our desk or in a cupboard or, pre-data center. I remember back then I, we were doing some stuff and my first job and, we were buying compact three eight sixes and the argument was do we get an SX or a dx?

And I, I can’t remember. I think. Had , four mega memory and it cost the organization $35,000 New Zealand dollars back then. Yeah, times have changed. Right now you can spin up those kind of things by just putting your credit card in or even getting your free tier and . You got a massive amount of firepower in there.

Andy: Yeah, and that’s the thing is it’s also a little bit disconcerting now because all that compute is sitting in the cloud. Certain vendors like to show compute in certain ways and show you cause and other vendors, they like to obfuscate that behind other kind of terminology, which yeah, guess Microsoft do.

So you’re trying to map the cloud compute with what you’re doing with on premises, right? And saying, okay, I’ve got a certain amount of cause and I’m running this certain amount of workload on premises. What does that look like when. I go to the cloud and I’m then dealing with a service that I can’t exactly a hundred percent map to those on-premises calls, but yeah, I to, I totally get it.

You’re choosing, hardware to run that software. You’re configuring that software as well. You’re configuring that software to death to get as much as you can out of that hardware.

Shane: Yeah. And I think like you, I, when I had my consulting company, when Redshift and AWS came in we jumped on board really fast. ‘cause in those days, you’ve had to buy a big million dollar Teradata box or you had to buy some Oracle database and again, rack and stack it with some leased hardware. And when Redshift turned up, that really was the first cloud database for analytics. And it’s interesting how, it’s lost market share. It was first to market and it’s obviously been taken over by Snowflake and Databricks and a few others. So what’s interesting for me is we see these patterns getting repeated.

We see, databases where we used to have to install them, and then we see databases as an appliance where you buy the hardware and the database used to come with it pre-installed. Now we’re seeing cloud databases, Redshift being the first and some more oh, should I say, modern. So ones that solve some of the problems I’m gonna have to vacuum your database for anybody that’s dealt with a Redshift cluster before. and so one of the things that’s interesting is this idea of data modeling. Because, I’ve been in the data space for 35 years and we’ve always modeled data, but every technology wave it seems to become hot and then cold. It’s yep we model, and then no we democratize.

And people can do all the work without any conscious data modeling. So what are you seeing in the modding space and in your part of the world at the moment?

Andy: I am seeing, and I’m gonna speak honestly, and off the cuff here. I’m seeing a repeated and constant battle to make sure that data modeling is at the forefront of a data platform project. That it isn’t just about the technology and it isn’t just about the data layout patterns that we’ve seen.

One of my sort of bug bears is that I’ve seen a lot of articles on LinkedIn comparing architecture patterns with data modeling patterns, and I’m thinking, hang on, you are really comparing apples to oranges. The architecture enables the modeling. The modeling is put over the architecture. I can’t imagine a scenario where any CIO or CTO is gonna talk to, members of their team and ask them, should we architect our data a certain way versus modeling it a certain way?

No. They are complimentary. So in the last few weeks, I’ve been thinking about the data modeling side of things and asking myself is it because it’s hard to do? Is it because it requires thought and time and collaboration? Where sometimes technology is a little bit of an easier thing to do.

There’s still lots of human elements in using and deploying and working with technology, but I just feel that data modeling, you really have to have that community grounding where people are working together to get what they need from the data into a shape that’s actually useful for their business. So yes, that was my framing on that, .

Shane: It’s an interesting lens and for me I’m with you, right? I talk about. Data architecture layers. . So how we are laying out the architecture of our data across our platform. And I’m a great fan of layered architectures. I’ve used them for many years and I see massive value in them. And then once I have , an idea of the layers, then I can talk about what data modeling patterns we want to use in each layer. And I’m a big fan of mixed model arts, which Joe Reese talks about in my experience, I’ve never actually used a single data modeling Pattern. Even back when I was dimensionally modeling I always had a persistent staging area that was some form of native relational model that met the structure of the source system before I moved it into the dimensional model. So if I think about that in terms of data architecture, and then I think in terms about data modeling, then I think in terms of technology, and I asked myself this question if I was moving into the data domain. For the first time at the beginning of my career, where do I find any information around data modeling? Like how do I learn? Because the courses and the books that we had when we started have the books are still there, but the courses have disappeared to a degree. I think part of the problem is lack of accessibility , to that content for some reason. But the other thing that you just raised, and I’ve never thought about it this way, is as a technologist, if I want to earn a product, the product, I give it a go, I read some of the documentation. Like these days I’ll probably perplexity it. And I get to have a go and I get immediate feedback. So I can, install the software or turn it on in the cloud, and then I’m immediately able to log on and then I can give it a whack. I can try to use it and it will give me feedback. If I’m doing that with data modeling that’s not true. I can’t just turn something on. I can’t get instantaneous feedback that my model is good or bad or right or wrong or doing what I want. And yeah. It’s an interesting point you raised. Maybe that is the reason that people love to play with technology because they can just learn it against an instantaneous feedback and probably that adrenaline rush.

That first time you load the data up and your dashboard turns up with a pretty graph, you’re like, eh, that was pretty cool. That’s a bit of an endorphin hit. Whereas I created a data model in Miro. Yeah. That’s a picture. It’s interesting lens.

You’ve got do you agree that one of the problems is lack of access to content and ways of learning what data modeling is and how you use it?

Andy: I think it always has to be a constant conversation with the data people around modeling, and it always has to be brought up, but it always has to be. Marketed and communicated that the technology is just the starting point. Modeling comes into it as well. You are delivering a product that is the amalgamation of the technology.

Of course, it is the process that the organization requires that data technology to do, and you’ve got there through modeling and landing the data in the right way for that organization. If we go back to resources, and you touched on a point there around the books that we had access to and the resources that we had.

So yeah, I remember Steve Hoberman’s data modeling books. I then remember. The Kimball Group books. I mean they were, and I’m, looking at the Kimball reader on my shelf right now and the Dimensional modeling toolkit because , yes. You required database technology to implement that model, but the technology could be from Microsoft, it could be from Oracle, MySQL, Postgres.

It didn’t need to be a specific vendor. You were applying a process to a technology. So unless you are guided towards these resources, and I was told, I remember being told in the early two thousands to go, I think it was a database administrator who told me to go and learn about normal form.

Database normalization because the data had to be structured in a certain way to facilitate a transactional system, right? Database normalization. And then when I started to move into more of the data warehousing and analytics side of things, again, I was guided. I was told it was a MicroStrategy consultant that I was shadowing so that I could understand the data warehouse that he was implementing.

Again, that consultant pointed me in the right direction. He pointed me to towards books and said I’ll, I can teach you the basics of this now, but of course you are going to have to get hands on. You’re gonna have to learn this. And at at some stage, I was told that I would have to learn these things and.

Get hands on. And I think that unless you’ve got those people now who are guiding people to do that, people have gotta be very proactive and they’ve gotta be going out there and they’ve gotta be saying, okay, I’m building a data platform. I’m using this technology.

What else do I need in terms of my knowledge that I need to apply to that as well? So there is a certain amount of guidance, and this is, I hope, being passed down by people. There are, great people out there like Johnny Winter who are talking about data modeling and, we’ll get onto the data modeling AI topic in a little bit that I.

Essentially took from Johnny, right? Johnny was talking about this and I looked at this and thought, wow, this is, yes, this is relevant. I like this. But yeah, so that’s what I feel. The other thing is, if you look at vendors, are vendors necessarily pointing people in the direction of the modeling processes?

I’m a Microsoft data developer, and throughout the years there have been various books around how to apply things like dimensional modeling, a modeling patterns to Microsoft products, right? And even now, you can go onto the Microsoft Fabric learn documentation, and in the warehouse, the fabric warehouse documentation, there’s resources around dimensional modeling.

They will tell you how you can do that.

So the vendors, like I said, have been providing some of this guidance and some of this documentation, but it’s very it’s small fry, right? It’s a few pages in their documentation on how to do it. So you’ve gotta, you’ve gotta be mindful of the fact that when you are coming at these data platforms, that you’ve got to be doing the modeling behind it as well.

And it has to be front and center to a project. Yeah, so that’s what I think about that.

Shane: Yeah. And I agree with you that I, if I remember a lot of my learning was from that grumpy old DBA. When you used to deploy something and it ran like shit I remember the standard response from the grumpy old DBA is when you said, oh, my, my ETL loads are going too slow. They used to always reply with, compared to what, that was standard response. And then, eventually they go and help you tune it. They typically tell you it was a data model or your code that was wrong, not the database. And then, again, if I think back, I think you’re right that often the projects we are doing were expensive and expensive ‘cause the people and time. Therefore there was a whole hierarchy of mentoring because you couldn’t afford to have somebody just rush off and do something that didn’t fit the way everybody else was working. So there was always a mentoring process where somebody more senior experienced than you, took you on the journey to learn to do things the way they did. And I think what’s happened is as we’ve got new technologies, we’ve got new forms of democratization, and therefore people are able to do the work quicker and faster using those tools. And we’ve lost that mentoring process, that knowledge transfer outside the tool itself. And also people aren’t being taught what we got taught in the early days.

We did a podcast or a webinar and we’re talking about data layered architectures. And a template that I’ve been working on. And one of the questions in that was, is the data persisted in the layer or is it virtualized, right? Or is it temporal memory? And one of the questions we got online is, what is data persistence? And I sat back and went, holy shit. And one of the comments from Chrissy was doing the webinar in Ramona with me was, know what persistence is because we’ve worked with tools when Excel didn’t save. And if you didn’t hit the little dis get and persisted that data down to your laptop and your laptop crashed, you’d lose it. Now we’ve got Google sheets where you type, it saves, it persists the data. We don’t have to care about that. And so I think for some of us, we forget some of the core foundational concepts that we just learn by doing things outside of data to a degree that we’ve applied. So if we think about that, then. Is the problem with data modeling right now? Lack of tools really there is still a lack of data modeling tools in our data stack. Or is it a lack of mentoring and therefore is AI the answer? Do we actually now see AI bots or AI clones or whatever we wanna call them, that actually become the daily modeling mentors for people that don’t have a physical mentor like we did? What do you think?

Andy: I think that the technology itself, and I’m talking about it could be Microsoft technology, it could be Oracle, it could be Databricks, snowflake, so on and so forth. They don’t have any guided ways of creating a model. I reckon the last time that I used a piece of software that guided me through the process that was very aligned with a very specific data modeling technique was analysis services multidimensional, where objects inside this model.

Were named after a specific modeling Pattern. So your facts, your measures, and your dimensions. If we’re talking about Kimball dimensional modeling, so that was really the last time that I used a piece of software that was quite aligned and essentially had a wizard that would take you through a modeling process.

‘ cause of course now we’ve got different modeling processes we could model. Yes. Third normal form. We could use dimensional modeling, we could model it data vault, and no vendor is locking themselves into a specific Pattern, right? Vendors are saying, Hey, you can bring this specific type of modeling to our software so they’re not baking in a specific modeling process.

One of my real asks in terms of software is to start to bring in some of these modeling aspects. I would love functionality, like slowly changing dimensions. So this is a dimensional modeling concept, which adds historical records to a, a reference table, to a dimension table.

And this is just my experience with the software that I’ve used. I really only see, Databricks that have done this, slowly changing dimension type two in their Delta live tables or lake, Lakehouse declarative pipelines. DBT, they’ve implemented slowly changing dimensions in something called snapshots.

So it is there, but it’s a little piecemeal and it’s not necessarily being massively called out in terms of this is the modeling Pattern and this is the feature that we’ve implemented for you to be able to realize that modeling Pattern. So then you were talking about ai, right? And this is where we start to get into productivity.

This is where we start to get into how can someone with not much experience of something ask AI to help them. And this is a classic case of where AI can help and start to move someone towards understanding how they work with the business. How they would evolve a Pattern as well. And like you were saying before, , people can get hands on with technology, they can get cracking with technology.

It’s binary, something’s gonna work or it’s not. If you model some data, you might not know whether that model works that you’ve created six months, a year down the line when all of a sudden something happens. That means that the model isn’t flexible enough. It hasn’t been thought of it, it hasn’t, there hasn’t been enough collaboration to understand the impact Act.

I went to the Fabric London user group. I was talking at the user group, but I was talking on technology, right? I was talking about a specific feature within fabric called materialized Lake Views. A little, segue from our conversation, but it’s a feature. It’s a feature in a piece of software from a vendor.

Johnny Winter was there, and he was talking about data modeling. He was talking about, sunbeam modeling. And this is a modeling practice that I’ve used in the past only because Johnny Winter has been talking about it, because it brings together a couple of areas of data modeling that I have used.

I just never thought there was this, port manto of these things. Beam business event analysis, modeling from, Lawrence Corr and the, the great Agile, data warehouse or Jim Stato and Mark Whithorn, which I now know is the pronunciation who was championing sun modeling, in terms of, a central event and then your sunbeams were your reference data in terms of how you brought context to that data.

And of course that was just the modeling aspect. But Johnny then started to show a tool called LE ai and that really got me thinking. So this is a data modeling tool in which. It’s focused on data modeling. It’s focused on understanding how to build enterprise data warehouses, how to help someone and guide them through the process of creating a data model.

And it’s very different from something like a data warehouse automation tool, right? So there’s data warehouse automation tools out there, either wcap and such in which you pretty much have to know the modeling Pattern while you are using these tools. Whereas all of a sudden I’m looking at Ellie AI and thinking, oh, okay, I get this now.

It is essentially an LLM or maybe multiple LLMs that’s been trained on data modeling, perhaps on architecture, layout patterns, technology even. ‘ cause it can help it then shape how it. It helps the user as they prompt their way to a data model. And at first I looked at it and thought it’s just another AI tool.

There’s millions of AI tools out there now. But I did think, hang on, if this is an AI tool that’s helping people data model, this can only be a good thing. This could only be of benefit to people to use something like this to help them through the process of modeling. They’ve got the technology they can use, co-pilot or chat GPT to help them write code.

But they’ve got these tools with the subject matter expertise of data modeling to help them and guide them through getting to a state where they might not have these problems in 6, 9, 12 months time of a model. The isn’t flexible enough because they can prompt their way into flexibility. So that was what I was looking at specifically with the AI tools.

Shane, so I’m curious to think about what you think about somebody who doesn’t have much domain expertise in data modeling, but is working in the data space using , these AI tools for modeling.

Shane: there’s a lot to unpack there. I’ve just made a whole page of note. So let me go through from the beginning of what you talked about and replay it back and my thoughts around it. I liked your point around opinionated tools. If I think back in, previous of technology for the data space, the tool really supported one modeling Pattern. Yeah. And realistically it was Kimball, Kimball was the number one data modeling Pattern that I ever saw in data warehousing in the old days. And that was because wrote good books. He shared his content online with his blog, which was free and easy to access, and he ran great courses. So getting access to how to model, he was the most accessible piece of content that you could find, and it made sense. I’m a big data vault fan. I like the physical data vault modeling Pattern. I’m not a fan of the data vault bi methodology but I find that content and access to how to model using the data vault Pattern of hub sets and links is incredibly hard to find.

It is poorly written. It is pay. And so I think that’s why we see with the advent of DBT and when people started to realizing they had to model data consciously, we saw Kimball take off again, right? Because the old content is still valid. Just on that though, I was really intrigued about your SCD two comment. So let’s take a segue on that and I’ll come back to the AI stuff in a minute. Because I remember with the original ETL tools that we were using, there was never a native CD two node. We used to always have to bloody well write that node ourselves. then when the cloud analytics databases came out, would’ve just made sense for me for CD two behavior.

That Pattern of historical recording of change data to be a database feature. . Not be a piece of code to detect the change and store it, but just make it a feature in the database to say, this table is SCD Type two table, because the database could take care of that change detection the, end dating or the flagging of a current is record. And so it’s really interesting that SCD type two is one of the patterns that you use for dimensional modeling all the time. When you’re physically modeling using that Pattern, but somehow we still seem to be lumbered with write some code that deals with it. Is that your experience? Are you still seeing people having to write code to, to implement the type two Pattern?

Andy: Yes, I am basically, and this was even talked about a few days ago, so I was at a little conference in Birmingham called fab Fest, which was focused on Microsoft Fabric. I was speaking about a function called, or a feature called Materialized Lake Views, which is essentially like Databricks is Delta live tables or lake flow declarative pipelines as they’ve called now.

‘cause they’ve, they’re abstracting it away from Delta because it’s not just Delta that this technology supports, but also DBT and someone said, why have we not got this SCD outta the box functionality? Because it’s almost like slowly changing dimensions.

Have they transcended the specific methodology in which it would be used in? And what I mean by that is when you started learning Kimball and you started learning dimensional modeling, one of the sub-categories. Was dimensions and one of the sub subcategories was slowly changing dimensions and all of these different types.

So type two has probably now become the champion, right? It’s the one that will track changes over time by adding new rows of data and the associated metadata to keep it. Yes, there are, type four and type six and type three, and all of those have their reasons to implement them. Type three, you’ve got multiple columns which can store, the current data and then the previous amounts of data, but they are a little bit more difficult to implement.

So most of the time people will use slowly changing dimension type two, because that’s the one that is the most. Relevant, the most prevalent and the ones that’s easy to implement. However, unless you are using some tools that have this functionality in built, like I said, DBT have functionality called snapshots, Databricks in their, lake flow declarative pipelines.

You can declare an SCD type. It’s not massively prevalent and soaked in to the data landscape. So this person I remember at this conference was saying I’m still having to write my code to implement my slowly changing dimension. I’m still having to say these are the columns that I would like to track.

This is the key that I would like to join on. These are my columns that I’ve defined for my metadata, my from my two, my is active. If you have that, any other metadata columns that you want to be able to track your changes. Then, and my last point on this is we look at the medallion architecture right now, I’m not gonna go into the ins and outs of, should we call it the medallion architecture, because to my mind it’s a data layout Pattern.

It’s not necessarily an architecture, but it has these different zones of data. We know raw is bronze and silver is cleansed, and gold is modeled, but silver. Which is the cleansed data. It hasn’t yet got to a stage in which it’s been modeled to a specific Pattern. There’s a lot of advocates that want to apply slowly changing dimension functionality to that silver data, but it’s not modeled yet.

It’s existing at the same granularity as the raw data. Obviously it’s gone through dedupe, it’s gone through cleansing and all that kind of stuff, but it’s not modeled yet, but we’re applying something that was a modeling Pattern or a modeling feature into this store of data. So I find that quite interesting as well, is that this SCD has almost been extracted as a feature of a modeling Pattern and can just now be used as a way of tracking changes over time.

But To your point, I just don’t see enough of that functionality automatically added to database and data products. People are still having to do it themselves.

Shane: And that affects adoption because if we say that a simple technical Pattern of type two implementation, where we know what the patterns are, right? We know that is detect change, insert row, start date, end date. If that , what you like is active or as current as a flag. We could actually just add all of those, and I remember in the early days, when we were constrained on database technology where the cost of those servers was expensive, the cost of the licenses were horrendous. We had to optimize to reduce costs. You probably remember it, we would argue type one versus type two. And we would type one by default, and we would type two where we knew there was value because the cost of type two was higher than type one. Now with cloud analytics databases, we really don’t give a shit, we just type two everything because we can. And it saves us problems later. And people are more expensive than those databases. And so we can be lazy to a degree, but there’s value in being lazy. What’s interesting for me is that Pattern is a very well known Pattern, yet it hasn’t become opinionated in tools. And when we talk about data modeling patterns, do we go. Snowflake versus star versus data vault versus anchor versus hook versus unified star schema versus anchor, there’s all these other patterns where actually they’re a little bit harder to be opinionated about. and therefore they’re harder to bake into a tool. So if we can’t do the simple stuff, how do we expect to do the hard stuff? And then the last thing I’ll say before we move on to the next part of your point that you had earlier was your silver’s, not mine. I think medallion has been great because it’s reinvigorated the conversation around layered data architectures and the value of them. But when you talk about your silver as being cleansed, I talk about my silver as designed and I model. We have a an opinionated data modeling Pattern, which is concepts, details, and events.

That’s how we model and what we call our silver, our raw is historicized. We are effectively applying an SCD two type Pattern on our raw data for a whole lot of reasons. So again, the problem with medallion is, it’s a nice way of describing a layered architecture. But soon as we get into any detail of what you’ve got in your layer, it’s not what I’ve got in my layer, and that’s okay.

As long as you tell me that you are cleansing in silver and it matches the structure of your source, and you tell me you model in gold, I get it. Now I get your architecture by you just using those words that opinion you have applied. So I think that’s the key, is no more, there’s only one way to Medallia.

It’s what do you mean by silver? So let’s jump on then, and then let’s talk about tools because. In the past we had tools like Irwin, oh God, ea sparks. We had some really hard to use data modeling tools to draw diagrams of what our models look like, and typically they were completely disconnected. We found it really hard to draw a diagram for our conceptual and physical model and then get that model as substantiate in our database easily. And the modern data stack, in the previous wave, now that it’s dead, we saw tools like Ali and sql, DBM, we saw visual modeling tools come out. But what was interesting is they became category, they were a party or stack.

So if you had a on data stack, you’d end up with five to 10 different tools to do your end-to-end processing. And we’ve seen a lot of consolidation in the market. We’ve seen a lot of those tools that are part of the stack disappear or get acquired or become features in, in one of the other tools, we haven’t seen that for data modeling. We haven’t seen the data modeling capability being bought back into those end-to-end stacks. And again I don’t do a lot of work in fabric, but I don’t think, apart from the power bi SQL server analysis services part of the Microsoft stack, they’ve never really had a modeling tool, have they?

There’s no tool I would go into that would help me create a conceptual or a physical data model and instantiate the physical model in a database within the Microsoft stack, or is there.

Andy: So this is probably one of the most asked questions in forums around the data modeling aspect because as you’ve said, we’ve had, tools over the years that have enabled us to do data modeling. Even SQL Server Management Studio, which you can download for free has got this almost live data modeling process attached to it where.

It’s very much a physical design Pattern. You can’t logically design something in its interface in fabric. We’re still there. We haven’t got anything that’s gonna help us logically design a data model outta the box. There’s nothing that we can start with and say, okay, I wanna start with the conceptual data model.

I wanna go down into a logical design, and then finally a physical design of my lake house, of my warehouse. we’re having to still use other tools to be able to do that. So whether people still use, Visio, whether they’re using, other things, AATE or SQL database modeler.

I think even Toad, for these kind of modeling tools, one of the things that I didn’t like is when Microsoft did deprecate some of the data modeling tools that were available within Visual Studio, I just thought you’ve got to have something that can help a process. When I look at Fabric or even Databricks, snowflake, all of these sort of cloud vendors is, they are the canvas, they are the paintbrush, but they don’t help you draw the picture.

You need a process to go and help you draw that picture. And there are, there are cloud tools that help you do the data modeling. Some of them, I think a lot of them are paid because. To go from conceptual to logical, then physical is a natural progression. People want to be able to create the physical data models.

Ultimately, yes, some people might be a little bit annoyed that they’re forced to go through, the conceptual and the logical modeling processes before they get down to the physical. But then they want to be able to click a button and it generates the code necessary for them to run and create that physical model.

No, I don’t see anything. And if we’re talking about Microsoft Fabrics specifically, I don’t see anything in Microsoft Fabric that’s gonna help you from the conceptual all the way down into the physical.

Shane: And again, you’ve gotta be, software should be opinionated. So the way we do it is I create the conceptual model and then it generates a physical model without me doing anything because our physical model is opinionated. So our conceptual model is. open in terms of, the things you create, the concepts, the, who does whats, ‘cause I’m a Great Beam fan as well,

lawrence CO’s book is one of the ones I, read early. We used to, when I had my consulting company in New Zealand, in pre COVID, when things went online, we used to fly ‘em over as often as we could to teach our customers how to beam and event model their data.

‘cause it was so useful. And that idea of, your conceptual model of who does what, your core concepts, depends on the industry, depends on the business case, the usage, the actions and outcomes you want to take. So it is a, a little bit less opinionated to a degree. but once you’ve got that sorted, you can make your physical modeling Pattern incredibly opinionated. And that’s why I think these disconnected data modeling tools are struggling from what I can tell. And you’re starting to see them now bring in the ability to actually generate the ETL, the code to, deploy the model and load the model.

Because that’s the space you have to be in. You have to be able to create the model and test it, that it has value. And so when I come back to the AI tools. I’ve played a lot with the lms played a lot with the bots, and I can see as a helper friend, as an assistant, it allows me to ask questions and provide some context around the industry, the use case, and get back a starter model for 10. And that’s really useful. But if I think back to this idea of the mentoring I had earlier in my career by those grumpy data modelers what they used to do was they used to stress test the model. There was something magical about the way you could give them a data model they could just look at it and then they could call bollocks on the things you got wrong.

It was just that innate Pattern matching in their heads. ‘cause they’d done it so often and they could go, yeah. That relationship’s not a one to many. It’s a many to many. It’s not gonna survive. Okay. The rate of change on that table is gonna be horrendous. You’re gonna go blow out your Oracle instance. Yeah. Or you have to upgrade to Oracle Rack, which will cost you, two legs, one arm and your three newborn children. I wonder if that actually is where AI tools have to take us is effectively an agent model, right? Where there is one that helps us build the model and one that helps us stress test the model, which goes back to that point you made a long time ago which was, when we work with technology, we get instantaneous feedback that the technology’s working.

It’s not when we work with data modeling, we don’t. So maybe that’s where the AI tools need to take us, help it to create it. And then another agent, which is, grumpy old data modeler that tells you where you got it wrong. What do you think?

Andy: So I think AI can help all the way through that process. And this is where we start to look at AI as less as a technical tool that will help us do something and more as a sounding board, as something that we can ask it to be quite antagonistic with. As you said, and like I, I raised the subject earlier about AI tools and helping with data modeling is you could go to an AI tool and say someone has told me to design, x, y, z system for argument’s sake.

Let’s say it’s a data warehouse or it’s a lake house and we’re gonna use, a data modeling technique that is best for reporting and analytics. And the LLM might reply and say, okay, well here are a few. Modeling patterns and, dimensional modeling is the one that you would use for a data warehouse and so on and so forth.

So let’s say they then pick that and, carry on doing that, they’ll then ask you questions about, I’ve got all these different entities in my source system. How will a dimensional model help me? So it’ll then work through those entities and say, okay, you’ve got these entities that look like they can be grouped together.

Perhaps that’s a dimension and it’ll work through the process with you. So let’s say you’ve got that sorted, so you’ve applied your critical thinking and not just accepted everything that the LLM has given you. You’ve gone back and dah. Maybe you’ve Googled, maybe you’ve asked other people that are experts in that area to say, okay, you know what?

I’ve spent a whole day generating this model. It would’ve taken me two weeks if I had to learn the theory and then do it. What do you think? I’m sense checking it. Someone might come back and say, okay, we can tweak a couple of things, but actually that looks pretty good. Great. So we’ve got our starting point then.

You talked about stress testing and Yes. So this is where you can then ask the LLM to be antagonistic and say so this is my data model we’ve got here because the business want to be able to report on these X number of attributes and this is how they want to measure it. But I know that there are other source systems and things like that.

Can you tell me what problems this model might have in the future? And of course you essentially, you are asking the AI to try and do some future proofing for you and some troubleshooting. And it might come back with some generic questions about perhaps your product dimension isn’t, deep enough.

What about other entities that you might need to bring in that you are, that you haven’t yet got links in your fact tables, but it’s gonna surface, it’s gonna help surface potential problems for you to then deal with. So I totally get that as well. And then the third point that I wanna add is about anticipating changes.

We touched on, Lawrence Core and, the Agile data warehouse book a little bit earlier, which Yes is, something that I go back to constantly and, we’ve got beam in there to help us do this. But the agile data warehouse is also there to help us iterate over a model as well.

So perhaps we then say to the LLM, look this first version of the model, okay, we’ve gotta set it in stone now because we’ve got project deadlines we need to get the data in because the business are gonna build X amount of reports. And we tried to make the model as generic as possible.

‘cause we don’t want a report driven model. We want the data driven model in here, but help us understand what we might need to do to modify the model and be a little bit agile. So I would say that the usage of the AI tools to help us generate the model is just the first part. Most of it is gonna be about asking the model to be quite antagonistic about the model it’s generated.

Yeah I think that it’ll help us build the model, but the most important thing is antagonize the model, test it hopefully point people in the right direction in terms of future proofing and certainly surface issues that might happen in the future. They might need to fix those issues. I don’t think I’ve worked with a dimensional model yet that doesn’t incur a certain amount of technical debt in how it’s implemented.

But if you can mitigate those things earlier on, that’s just gonna be of benefit. Yeah, so that’s my thoughts on that, Shane.

Shane: Yeah, and that’s that problem between, doing a model quickly that gets value now and trying to boil the

Andy: Yeah.

Shane: for all changes in the future. Or an enterprise data model. Again, going back to that anti-patent, we have these days of, a data modeler sitting in a room for two years doing , one enterprise data model to rule them all that nobody implements. One of the things we’re doing is we’ve been experimenting in this space and we have a bunch of partners use our platform and we got one of them to experiment. And we had a use case around Google ads, so the partner needed to bring Google Ads data in and deliver it for a customer. And so that first part of the agent that opinionated agent that’s gonna help you do the initial model was really interesting. we found was, first thing was we were lucky that Google ads gave us effectively context about the tables. It brought in metadata that described the tables quite well. So that was really valuable for the agent because it now got a bunch of hints of what the source data looked like. The next thing was we have an opinionated design Pattern for the way we model.

And so , our agent already knew about that, so it knew what the rules of the game was, it knew it couldn’t dimensionally model it, it knew, it couldn’t anchor it, knew all the patents it couldn’t use. It knew what our patent looked like. So that opinion was effectively already in the agent. As a a bunch of rules. We also gave it the information product canvas that a partner had done, which is a description of what the actions and outcomes and business questions that need to be answered first. And what that gave was a boundary to the agent to say, don’t model all the Google Ads data. Only model the data that’s gonna support this outcome. So again, it gave it a boundary that led a lightly model. Now, what we didn’t do was, the bit that you are just raising is then stress test change, right? We didn’t say what’s gonna happen next, but deal with that in different ways at the moment. But I’m gonna think about that one really well, so boundary of opinion that was given to the modeling agent meant it did a good job, If I had just said to a, I’ve got Google Ads data and model it, I have a theory and I’m gonna go test this. I reckon every time I ask it that even though it’s non-deterministic, it’s gonna kimble model it. The reason I say that is if you think about why has Kimball and dimensional modeling become the number one modeling technique for DBT? Because if you are an analyst and you are moving into the engineering space and you hear that you need to model some stuff and you use an LLM or you go Google search, you’re gonna come back with dimensional modeling every time. ‘cause as I said, it’s the most freely available. Describe content in the world for data modeling in an analytics space, in my opinion. so therefore, the lms, who trained on everybody else’s content without paying for it, won’t go into that one. It’s gonna have the richest piece of content in the LLM for dimensional modeling. interesting question on that one is actually, if I go and ask an LLM to model it with no constraint, no opinion, I bet it’ll come back with Kimball Modeling.

What do you think?

Andy: And I think the LLM would probably not be displaying any emotion. So let me expand on that. So you touched on data vault earlier and you say that’s a, that’s a data modeling Pattern that you like. If we go back several years, actually decades we’re talking about Kimball versus Inman.

We were talking about dimensional modeling versus third normal form. Then Dan Linted comes along and we’ve got data vaults. There was lots of emotion involved in people comparing these technologies. In fact, all. Of those people, bill Inman, Dan Linted, Ralph Kimball, they all said, and it’s in their books, that these modeling patterns are complimentary.

Bill Inman would say your enterprise data warehouse can be third normal form, but for reporting and, for feeding into analytical tools. Kimball model, the dimensional model is great. Kimball would say, ah, okay, you can do that. But yeah, you can also do your, your enterprise data warehouse and dimensional modeling.

Okay. There might be a little bit difference of opinion there, but they were still complimentary. Even data vault, Data vault, you can’t just plug straight into analytics and tools. and I will. Admit and agree that it’s great for tracking changes over time in the lowest level of granularity, giving you the most flexibility to do what you want with it afterwards.

But a dimensional model is very good in plugging it in. But of course, we see all those debates out there. This versus this, the versus this, the LLM doesn’t will take those arguments into consideration, but it has no bias. It has no emotion attached to any of those modeling patterns. So I would say that LLM will probably come out with dimensional modeling because it’ll reason that, okay, you can store your data this way, but it’s not going to be what you need to design the model that’s going to be delivered to the business.

And I would like to test this as well and antagonize an LLM around these modeling patterns and ask it. Okay, I’m gonna be designing this. What do you think the best model is gonna be? And then I might add in a few, trip wires to it and say what about data vault? And what about this?

And I’m hoping that the LLM would say, yes you can use those modeling patterns, but it’s generally agreed that they are complimentary and that you can add on a dimensional modeling Pattern to a third normal form or a data vault. But I am hoping that the AI has less shall we say, emotion attached to picking that data modeling Pattern.

Shane: if it’s got no emotion, we should ask it for a definition of data, product, semantic layer

Andy: yeah.

Shane: We’d love to argue. I’m gonna, I’m gonna disagree with you on that one. And the reason is I don’t think LMS are reasoning, they are Pattern matching and tokenization based on a bunch of content. my hypothesis, and it is just a hypothesis, is the Kimble and dimensional content is being far more widely available. And therefore that model has a bias towards using it However. It’s just a hypothesis. And one of the things that Joe Reese has been doing as part of the practical data modeling community is just testing stuff live with a bunch of people. I’m gonna suggest you and I do that. I suggest that we figure out how to do a live session. bring up and, multiple LMS and with a bunch of other people watching or helping us. We just bash the snot out of it and try and see actually is there a bias for a modeling technique? But if I take that away for now your comment around some being really intrigued me, because Johnny’s mentioned it before and I struggled to find any content around it, it’s one of those data modeling that is actually quite hard to find anything about it. And good point. I need to get somebody on the podcast to come and explain it. So if I wanted an LLM to assist me in designing a Sunbeam model because I’m opinionated that’s the model I prefer for whatever reason, I think it’s gonna struggle and that will be another test, maybe we can give it a bash on that to, to see , so I think if you’re using an LLM to assist you in modeling and you are using a well-known Pattern that you are opinionated about, so data vault dimensional, third normal form potentially, you probably don’t need to use a lot of reinforcement with the LLM because it knows what you’re talking about.

If you wanted to bring in some of the more obscure modeling patterns, like something, I think you’re gonna actually have to pass it a reinforcement of that content. So one of the interesting things about Joe recently, what he does with the practical data community is he does live sessions where we all jump on screen share and we actually test out a hypothesis. And so one of the ones that he did was this idea of could we start with a business problem in an industry none of us knew. Get the LLM to help us understand the industry, create a conceptual model, move it to a logical model, move it to a physical model, and actually implement that physical model in a database. And that was fun. But what was interesting for me was observing the way Joe approached it, and he comes from more of a data science background than I do. So he tended, in my view, what I saw was approach it from an EDA, an exploratory data analysis approach. So he would be looking at the data sources and trying to understand the data that’s coming in, because that’s how we thought. Whereas for me, I come from more of a business background. That’s how I’ve been trained.

So more of that. Who does what business process stuff. A lot of the beam first part of the book. And so for me, I typically wanna understand the who does, what’s the core business events, the core concepts, and that’s how I model. so again, I think in terms of an AI assistant that helps you model, you probably want to train it around your modeling process.

Or maybe you don’t actually, maybe the LLM should decide what inputs it needs. Rather than you being opinionated on how you typically model, what do you think? First of all, which way do you model? Do you think source specific and understand the data first, or do you try and understand the core business events?

And then do you think you should be opinionated to the LLM about which way it should approach it?

Andy: I’ve always modeled from the source system side of things and then taken into consideration the requirements from the business. And I know why that is, and that’s because I am fundamentally a technical person. So I’ll always want to default to looking at the technical aspects of things. Hey, I can look at source systems, I can understand schemas, I can understand tables and columns and the domain values within that, and then I can look at matching that to what the business wants.

And I’ve worked with people that work the other way around. They are interested in the art of the possible. They’re interested in what do the business need to make the decisions, and then let’s go and find what we need from those source systems. So I’ve moved my needle a little bit more towards the art of the possible, right?

This is what the business ultimately needs in terms of decision making. If they need to be more operationally efficient, if they want to, take advantage of opportunities in the market, challenge competitors, whatever, I then go and look at those source systems and see what’s available. And then sometimes you really can’t realize.

Some of that data that is needed for the business, right? Because it’s just out of the hands of those source systems. And I suppose you could calculate it and you could infer it from that data, but I’ve tended to work that way. And then interestingly, when you were, talking about AI and Joe working through the kind of modeling aspects from that perspective, yes, I can totally see someone asking the model to work with them in a specific way and say, right here is my source system.

This is what I need to do. Now when those source systems, and if you’ve got access to schemas and you can provide that schema to an LLM and let’s, let’s say it’s An LLM that’s secure, it’s within your organization or it’s within your data boundaries. So you can give the LLM that schema and then you ask it to model that for you.

You may then ask that LLM I need to join it with other systems as well, and this is the system and this is the schema. So it could then help you generate the model to incorporate multiple systems or showing you some examples about how you can, join those systems together.

What I wouldn’t want an LLM to do is just go crazy with the business requirements and say, okay, this business operates in. This domain. So this is all of the data that they need to be ultra competitive and at the top of their game. And then I’ve gotta scrabble around trying to desperately find where I’m going to get this data from.

That the model says is going to help me build the perfect data model for the business. I guess I’m a data modeling pragmatist. I would look at the source systems and I would look at how they support the business in what they want to do. And then I would probably work with the LLM in that fashion.

I’d be saying okay, here’s the framing. Here’s the context. This is what we need to measure, but this is what we’ve got in our source systems, and this is the hard facts about what we’ve got in the source systems. Help me build that model that can, realize what the business wants, but work within the limits of what I’ve got with those source systems.

So that’s what I would do.

Shane: I was just thinking then, back to my point about, enterprise data model is sitting in the cupboard for two years to come out with the most beautiful data model ever. That’s basically a person going into an LLM now and saying you have no constraints. Here’s the industry. Gimme a data model from scratch that does everything .

It’s quicker. It’s not two years, but it’s just as unimplementable, that’s a bad word. One of the things we found was if we think about chat GT five and this idea that, we didn’t get the a GI that we expected, but what we got was a better interface. So instead of having to decide what type of LLM foundational model you wanted to use, you tell it what you want to achieve and it works out which model is the best fit for you. And one of the things we found when we were experimenting was. If we had one agent, our agent’s called 80. If we asked her to do everything, she was okay at it, but she wasn’t great when we broke her out into sub-agents. So we had agent, 80, the data modeler, 80, the what we call change rules, like the ability to write ETL and we gave her a clearer opinions and a clearer boundary.

We, we got a really good uplift in the accuracy, right? And the the evals that we got back in terms of, we got better responses that made our lives easier. so of the things when I’m teaching my canvas I actually use something that Lawrence talked about, right? Which is modeling based on source, modeling based on report or modeling based on business process. It’s one of the things that stuck with me for many years. And so now I’m thinking based on that. What we really probably need is an agent that we can go to and say give us a model based on the source system. And another agent. We can say, give us a model based on the core business events or business processes, or the who does what. And a third one where we go give us a model based on outcome, and then we give it the constraints of what we actually wanna achieve in the next iteration. What do we actually have to deliver and we get it to model it for us. So based on those three, tell us what the actual optimal model is to achieve this outcome and then stress test it for me. So that is getting a bunch of inputs, right? Because if I think about it. That’s what great modelers do. they take a stance, but then they always jump, your technical you source first, but then you go right now what are the core business events?

Customer orders, product good. I don’t need to worry about store ship’s product in this iteration, So I’m only modeling in that boundary. And then, okay, what do we know we have to deliver? Is it a dashboard, is it a data service? And then you are using that to iterate initial stance of that model until you get something that is fit for purpose.

If I think about it, that’s what we do as humans. So that’s probably the process we need to encourage the L LMS to do. What do you think?

Andy: I like that idea of asking each of those models to come up with their specific version based on those constraints. You do it by source system, you do it by business process, and then getting them to antagonize each other in terms of. Getting to a realistic result, taking on board each of those aspects.

And what I did make me laugh in my head when you were talking about, a great modeler is that in the Rocky films there was Apollo Creed who said to Rocky that, you fight great, but I’m a great fighter. And I think that about the modeling domain as well.

you can point at a person and say, you model great, but I’m a great modeler. And that’s just built up through experience, That’s just built up through battle testing models that you’ve created over the years and iterating over those models over the years. And like you said, in a couple of points before.

About the LLMs generating these models and coming up with dimensional models because that’s what the prevalent documentation will have. Those models have been trained over that documentation, so those models will be like, ah, okay, this seems to be the most popular way of doing things. So then when those LLMs are generating a model based on a source system, it’s gonna be based on the context of them understanding that source system and what the output of that source system is.

I look at something like Dynamics, which has generally been something that is quite difficult to model when so much customization happens within the platform. There is no real vanilla implementation of dynamics, which means that the LLM is gonna have to. Understand business context, not just source system because I guess the source system is just going to have all these entities that it might think I don’t have any information, because those sorts of things haven’t been discussed before.

Before. And as human beings, I guess we can reason over those things and we can hypothesize and we can make a best guess and iterate over perhaps the LLM can’t really do that because it’s been trained on previous data, previous examples, and it doesn’t have the ability to think outta the box if it hasn’t encountered something before.

Yeah, interesting point there, Shane.

Shane: Although I would posit that it probably has encountered it because it’s got access to information outside of data warehouse, data modeling. So it’s gonna have all the books on Dynamics implementations, it’s gonna have all the blog posts of people who have customized it. But yeah, I get your point, especially things like SAP, where, who knows how that bloody thing works.

It’s it’s gonna have more knowledge than I do on that. But less knowledge than an enterprise. SAP data modeler, who does it for a thing. so again, I think it comes back to being clear about when you want to provide an opinion when you don’t. So when you wanna provide an opinion, ‘cause it’s important that the LLM or the agent stays within that boundary for you or where you just leave it because it in theory has access to more expertise and knowledge than you have in that, in a specific space. And so if we go back to medallion, if we go back to layer data architectures, that’s actually a really good place where you might wanna be opinionated. Because if you are saying I want you to help me do a conceptual model, you probably don’t care about your layered architecture. But if you are saying, I want you to help me create physical models that I’m gonna implement, I know that. My layered architecture is I have a designed layer that is concepts, details, and events. And my consume layer, is a one big table. And I know that your cleanse layer is source specific data structured with data being cleansed. And your gold layer is, I’m guessing, a dimensional model,

if each of us had the same agent, but we put in those opinionated boundaries, we are gonna get back physical data models that are more fit for purpose for us to implement in our platforms of choice. And, realistically, if the L and N came back to you and said throw away your dimensional models and do one big table from cleansed, you are probably gonna look at that and go the cost of change is quite high.

I really need to understand why you’re making me do that. Because I have to learn a whole lot of new things and I have to rebuild everything. So I think, again being opinionated where you want to be opinionated, that makes sense is important. And then being free and hippie and open to whatever, the great a, a I agent in the world tells us as a start for 10 and then stress testing it with expertise.

I think that’s where we’re gonna end up. What do you think?

Andy: So I think that opinionated aspect is quite important in terms of how you approach the LLM because being opinionated about something means that you’ve got strong convictions in how you want to do something. Which then also means that someone else could be opinionated and have strong opinions in how they want to do something.

And as we know over the years when you’ve been working with other people that have experience doing these things, there might be a crossover, there might be some differences of opinions. How is the LLM going to know those differences of opinions that you can work out and collaborate on? Yes, it’s got this body of knowledge.

Hopefully the people with the opinions have been provided information, writing books, writing blog posts that the LLM can take on board and learn and use. But if I come to a project and use an LLM to create my medallion. I also ask it to create my data model based on my biases and based on my experience and the way that I wanna do things, I’m probably gonna guide the LLM to a certain conclusion.

Someone else is gonna come along and say I wanna start this project. I wanna lay my data out. Do you have some thoughts about how I wanna lay it out? I have used this in the past and then I wanna use data modeling to do it. It might come up with a slightly different outcome based on your own strength of conviction and your biases that you’ve put into the LLM.

It might come back with something that is generic at first, but then you might say to the LLM, actually. I don’t want my ization in silver, I want it in raw because that’s traditionally where I’ve put it. Whereas someone else could say, okay, why have you asked me to put my ization in raw? I’ve learned that it’s supposed to be in silver, but if we remove the LLM from this conversation for a second, it’s almost like we’d get the same outcome if it was just humans.

Because humans would go into an organization, implement a data platform. and based on their experience and biases, they would implement it a certain way, a different set of people could go into that same organization and implement it slightly different.

So I don’t think we’re necessarily solving the problem of these differences of opinions or the convictions that you have. It’s just using another tool to help, possibly reason, to help possibly have this entire body of knowledge and experience our disposal that we can interrogate and hopefully gets to a more human slash ai reasoned outcome.

That’s what I use it for. So I would use AI to help me model, and I Based on my experience, but I’m also not going to be arrogant enough to think I know everything there is to know about data modeling. So I also want to ask it questions about have we thought about it this way or have we thought about it the other way?

Just so that I can bring a little bit of critical thinking to it as well, and then evaluate the outcome. Yeah, it’s an interesting point there, Shane.

Shane: I like that. We’ve got to is, ‘cause it goes back to where we started, which is if you are new to the data domain and you wanna understand data modeling, there really isn’t a lot of great content apart from the Kimball stuff around. And there are other techniques. So now LMS give us access to ask questions and effectively get educated on what the art of the possible is.

What could I do? The second thing is, the mentors, the grumpy people who could just look at your model and tell you where you got a. Wrong. They’re not really so prevalent in organizations anymore for some reason. So again, using the LLM to provide that expertise, that rigor, that stress testing, that feedback is really valuable. There is another lens, but we don’t have time to go into detail, but I’ll just drop it in here ‘cause maybe we come back and have another chat about this idea when you’ve thought about it a bit, which is centralized platforms may disappear. So if we think about CRMs, they are centralized platforms that everybody uses with a core bunch of features, they’ve being, built in a way that become reusable. And what we’re seeing now with vibe coding is actually you could build one feature, one app that does one thing really well really quickly. and it doesn’t need to be part of a shared platform. And that’s gonna be a really interesting change in the market when that happens. That also means that you can buy code something that has 30,000 lines of code. And while you should know what it does, potentially, you don’t need to care because if it’s safe and secure, it does the job actually. You don’t need the expertise to understand how it does it. we apply that to the data domain and this idea of moving away from shared data platforms, that I’ve created this information product that does one thing well, and I’ve got the LLM to design, five different architectural data layers using 17 different data modeling techniques and a bunch of code, I don’t understand, as long as it doesn’t have to be reusable, then maybe we don’t care and again, I’m old. I find reuse really valuable. I find expertise and shared language really important. But maybe, the New World is one and done information products where the LMS are. Doing everything completely different each time you deliver a product. But we’re out of time.

So I’m gonna leave that one on the table, maybe have a think about it, and that could be a, a good follow up conversation around how would we use AI and LLMs to remove the need for reuse and shared platforms in the new Age Worlds. But before we finish off how do people find you? How do they see what you’re reading, what you’re writing?

You’re obviously spending a lot of time going to some great conferences what those conferences are and how they can find you.

Andy: Yeah, so I think the first thing is Link Tree. So Link Tree is just is basically my go-to in terms of, I. A jumping off point for people to find me. So that’s basically, link Tree slash Andy Cutler, so A-N-D-Y-C-U-T-L-E-R ‘cause that’ll take you to my company. So data high.com.

That’ll take you to my community blog, which is serverless sql.com. There’s my Blue Sky account, there’s YouTube, which is data high, and then my LinkedIn as well. So the Link Tree, Andy Cutler will take you to everything You need to do most of the conferences. I am in the Microsoft space and predominantly fabric, which is, which is where I, I spend, as the Fresh Prince of Bel Air would say most of my days.

The next conference is over in Oslo, so that is fabric February. So that’s gonna be over in Oslo. There’s obviously SQL Bits, which is, one of the UK’s biggest data conferences as well. so I would say, I generally post on LinkedIn blog posts, opinions, lots of conversation as well.

And as I said, yeah, Tre Andy Cutler, that’s where you’ll find me.

Shane: Excellent. Alright, hey, thanks for a great chat around AI and LMS and data modeling and I hope everybody has a simply magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

The pattern of Metric Trees with Timo Dechau

Shagility — Fri, 21 Nov 2025 18:09:16 GMT

Join Shane Gibson as he chats with Timo Dechau about Metric Trees

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/the-pattern-of-metric-trees-with-timo-dechau-episode-77/

Listen to the Agile Data Podcast Episode

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

You can get in touch with Timo via LinkedIn or over at https://timodechau.com

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Executive Summary

Metric Trees: A Framework for Strategic Alignment and Business Transparency

Metric trees are a powerful framework for visualizing the mathematical and logical relationships between key business metrics. Their primary function is to deconstruct high-level, often non-actionable, “output metrics” like Monthly Recurring Revenue (MRR) into a hierarchy of actionable “input metrics” that individual teams can directly influence. This transforms abstract company goals into a tangible map that guides strategic planning and execution.

The core value of a metric tree lies in its ability to serve as a shared language and a visual representation of the entire business model. It fosters alignment across disparate teams—such as product, marketing, sales, and data—by clarifying how their specific activities contribute to top-line objectives. By creating this transparent map, organizations can have more focused conversations about priorities, diagnose performance issues, and measure the true impact of strategic initiatives.

Implementing metric trees effectively requires a blend of product-thinking patterns and data discipline. Methodologies like Event Storming and Domain-Driven Design are crucial for identifying the core processes and high-value “heartbeat events” that underpin the business. Ultimately, a metric tree is a strategic tool for planning, communication, and high-level monitoring; it complements, rather than replaces, the need for deep-dive, explorative dashboards.

The Challenge: Isolated Metrics and Disconnected Teams

Organizations often struggle with a fragmented understanding of performance, driven by metrics that are viewed in isolation and teams that operate on different “planets.” This disconnect manifests in several key challenges:

• The Problem of Isolated Metrics: Stakeholders are frequently presented with long lists of potential metrics (the “PDF with 100 SaaS metrics” problem) without any context for how they interrelate. A metric like MRR, viewed alone, offers little guidance on what actions to take. As Timo Dechau states, the critical missing element is the “relationship” between metrics, which is essential for making them actionable.

• The “Iceberg” of Complexity: Metrics that appear simple on the surface, such as “Active User” or “MRR,” conceal immense complexity. Defining these metrics accurately requires a deep understanding of the product’s specific use cases and the business’s various edge cases, a process that can take weeks or months. A well-defined metric can “change the track for a company,” but a poorly defined one creates confusion.

• Siloed Data Cohorts: The data domain itself is often fragmented. Cohorts focused on the data warehouse, product analytics, and data science frequently use different techniques and technology stacks, even when working with similar data patterns. Product analytics, with its focus on behavioral event data and sequence analysis (funnels, cohorts), has historically operated separately from classic Business Intelligence (BI), creating what Dechau describes as “two different planets” that rarely interact.

Defining the Metric Tree

A metric tree, also known as a driver tree, is a visual framework that maps the relationships between metrics, showing how lower-level inputs drive higher-level outputs.

• Core Concept: It functions as a deconstruction of a primary business goal into its constituent parts. Every metric has a relationship to another, and the tree makes these connections explicit.

• Primary Function: Its purpose is to translate a high-level, non-actionable output metric into a series of actionable input metrics.

◦ Output Metric: A lagging indicator that reflects past success (e.g., Revenue, Profit). It is difficult for a team to influence directly.

◦ Input Metric: A leading indicator that teams can directly control (e.g., New Accounts, Conversion Rate from Trial).

• Structure: The tree is a hierarchical model that can often be expressed as a mathematical equation. For example:

◦ MRR is composed of New MRR + Expansion MRR - Contraction MRR - Churned MRR.

◦ New MRR can be broken down further into New Subscribers * Average Plan Price.

◦ New Subscribers can be derived from New Accounts * Account-to-Subscriber Conversion Rate.

This decomposition continues until the metrics at the lowest level of the branches are things a team can directly execute against, such as running more webinars to increase New Accounts.

The Strategic Value of Metric Trees

The primary value of a metric tree is not just in the metrics themselves, but in the clarity, alignment, and strategic conversations it enables.

A Shared Map for the Business

The metric tree acts as a universally understood “map” of the business operating model.

• Creates a Common Language: It allows different departments to point to the same part of the map and understand how their work affects others, breaking down communication silos.

• Fosters Transparency: It makes the mechanics of the business model clear to everyone. For many employees, it may be the first time they see a clear illustration of how the company generates revenue.

• Reveals Interdependencies: The map highlights how initiatives in one area can impact metrics elsewhere. Dechau notes that workshops to build these trees often lead to “eye-opening” moments where teams realize their actions might be inadvertently hurting another revenue stream.

Driving Action and Measuring Impact

Metric trees connect daily work to strategic goals, making it easier to plan and measure initiatives.

• Connects Strategy to Execution: Teams can clearly see their area of influence on the map. A marketing team knows its efforts to generate new accounts are a direct input to the company’s overall revenue goal.

• Measures Initiative Success: A specific metric tree, or “sub-tree,” can be built for a new initiative. This allows the team to define success upfront and provides a “control instance” to validate whether local efforts (e.g., A/B tests) are creating a meaningful impact on the larger business goals.

• Identifies Opportunities: By populating the tree with data, teams can spot areas of high potential. For instance, a part of the funnel with high volume but low conversion rates becomes an obvious target for optimization.

“If a metric is not actionable, it will have a hard life. It lives lonely on this dashboard and no one has an idea what to do with it.” - Timo Dechau

Implementation and Good Practices

Building and using a metric tree is a strategic exercise that requires a structured approach and an awareness of its limitations.

Starting the Journey

1. Map the Process: The first step is to gain a deep understanding of the customer journey. This is best achieved through a collaborative workshop, like an Event Storming session, involving people from all relevant disciplines who can map the process from initial awareness to long-term retention. This process naturally identifies the key milestones and potential metrics.

2. Align with Strategy: It is critical to sit down with the leadership team to understand their strategic priorities for the next 6-12 months. This alignment ensures that the initial metric tree focuses on what is most important to the business, which dramatically increases buy-in and adoption.

The Art of Event Tracking

A robust metric tree is built on well-defined data. This requires moving beyond simplistic interaction tracking to a more meaningful model of product usage.

Tracking Interactions

Capturing every click, scroll, and granular user action. This is the “take everything and look at it later” approach.

A high volume of noisy, low-signal data that is difficult for human analysis and disconnected from business success.

Tracking Product Usage

Applying Domain-Driven Design to identify core business entities (e.g., Account, Subscription, Project) and their lifecycles (e.g., Created, Updated, Deleted).

A small set (~15-20) of high-value, meaningful events that directly reflect product usage and can be used to build core metrics.

A key goal is to identify the “Heartbeat Event”—the single, central event that proves the product is alive and delivering value. For Slack, this is sending a message; for Miro, it’s adding an asset to a board. This core event can often be used to define multiple key metrics.

Practical Considerations and Limitations

• Keep it Simple: Avoid the temptation to “boil the ocean” by mapping every conceivable metric. An overly complex tree with 90+ nodes becomes an un-operational “monster.” It is better to start with a simple model and use sub-trees for specific initiatives.

• Acknowledge Timelessness: A standard metric tree exists in a timeless space. It does not inherently account for the time lag between an action and its result (e.g., an increase in new accounts may not impact revenue for 60-90 days). This requires separate cohort analysis, which does not fit the tree structure.

• It’s Not a Dashboard: A metric tree is a tool for planning, communication, and high-level monitoring. It is not designed for deep-dive, explorative analysis to find the root cause of a problem. Dashboards are still required for that function.

• Who Does the Work? The skillset required—combining product thinking, data expertise, and strategic facilitation—is often found in senior data leaders. Heads of Data are well-positioned to lead this work, as it provides them a strategic lever to connect their team’s efforts directly to business value.

The Role of North Star Metrics

The concept of a North Star Metric is closely related to, but distinct from, the top-line metrics on a business-focused metric tree.

• Correct Definition: A North Star Metric is fundamentally tied to successful customer experience. It measures the value customers receive from the product. Revenue is the result of delivering this value, not the value itself.

• Common Pitfall: A frequent mistake is labeling a revenue goal like “New MRR” as a North Star Metric. This confuses an internal business outcome with customer success. Shane Gibson jokes these should be called “South Star” or “East Star” metrics instead.

• Relationship to Metric Trees: The North Star Metric is an ideal candidate for the top of a product-focused metric tree, with input metrics below it defining the user behaviors that lead to a successful customer experience.

On North Star Metrics: “The concept of a North Star metric is it’s always bound to successful customer experience, it has nothing to do with revenue. Obviously we hope when people have a good customer experience that... it has a causal connection to more revenue.” - Timo Dechau

Future Outlook: Metric Trees in an AI-Driven World

As technology evolves, the principles behind metric trees become even more critical for managing complexity and providing essential business context.

• Context for LLMs: A defined metric tree provides a structured map of the business model. This is invaluable context for AI agents and LLMs, enabling them to perform more accurate “what-if” simulations, brainstorm strategies, and answer complex business questions.

• Governing Democratized Development: The rise of AI will enable the rapid creation of thousands of small, single-purpose applications (”AI slop apps”). This will create immense data complexity. A framework of core concepts (e.g., a universal definition of “user”), governed events, and a central metric tree will be essential to ensure these new applications contribute to meaningful business goals rather than creating data chaos. The metric tree provides the “principles, policies, and patterns” needed for this new landscape.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Timo: , And I’m Timo Dechau.

Shane: Hey, Timo. Thank you for coming on the show today. I would like to talk about metric trees, something that’s intrigued me for a little while, so it’s great to have somebody who knows a lot about them on the show. But before we rip into that, why don’t you give the audience a bit of background about yourself.

Timo: I didn’t start out in data. I don’t know. I, it is interesting question would be if anyone actually ever started out in data. I started out in product. I will give some connections to metric trees later. Why? It basically was really important for me to connect it with product. . I started out on product was very quickly annoyed by people doing features just by gut feeling. I wanted to have a different kind of layer, so therefore I started to introduce analytics data to it. So this was really the early days. It was not even called product analytics at that point of time. The term came later. , So I spent a lot of time while walking through different kind of product roles to always setting up data setups for that. Also did the first connections there with data warehouses. Just a slightly little bit. And then after eight years in product, , I decided to go all in data.

So it was more focusing on the analytics side of things. So I’d say classic marketing analytics and product analytics. At that point of time, it was the high time in Berlin where a lot of e-commerce startups were really growing a lot. And so I was doing first data warehouse setups mostly in the marketing analytics space. Like people wanted to have their own attribution models and these things. I did this I spent then a lot of time in this kind of area. And then at some point I thought, okay, I want to go back to the roots and want to do more product analytics again. But I wanted to do product analytics in the data warehouse, , which required some things that were not so common at the time.

So one is like an event data model, which some people had some experience with, but not. Potentially for product use cases. , And then the second thing is there was no approach, like Amplitude postdoc were the classic tools that you do product analytics, they then were possible to put them on top of a warehouse. But then I figured out actually there, most of the time they’re too complicated for just doing product work. So in the end, you have to have a metric first product analytics approach. And this is where I came back to metrics where I never really had a good relationship with them.

So we might talk about this in this session. But I had to basically revisit my relationship to metrics. And this is where I came across metric trees again. I had them already at university. And yeah, starting to dive deeper into this.

Shane: , I’ve followed your stuff for a long time, so thank you for sharing so much great content. And what’s always intrigued me is in the data domain, I could almost see a data warehouse cohort, a product analytics cohort, and a data science cohort. And , although each of those cohorts are using very similar patterns, very similar techniques.

They always seem to use different technology stacks. And then we had the whole, what was it? Composable, combustible, whatever it is. Yeah. Yeah. So that idea that you don’t need a dedicated product analytics tool. You can just do it off your cloud data warehouse or cloud data warehouse database.

So it’s really interesting that there was almost a separate track, wasn’t there for product analytics and the techniques you used and the technologies compared to some of the other, cohorts or tracks in that data domain.

Timo: It is super weird because like sometimes you talk to people in the data space. So let’s say who do classic bi and , this was my problem when I was starting out. So because I was looking for specific way how I can model product data, the same as marketing data. Marketing data is a bit less, but like it’s a lot behavioral data.

So a lot of event data. And you do sequence analysis so funnels or cohorts and something like that. And it was really a hard time to talk to people who are in the classic BI space and tell them . Most of the data models don’t really work for me because I have to do sequence analyze.

So I cannot create 200 fact tables. It would be a bit crazy. And , I really had a hard time to explain, but it was also interesting because I learned a lot from them. I think some learned a little bit from me but it’s interesting because everything calls data, but I always call it they live on two different planets and they rarely visit each other.

I think it’s getting more, it’s getting better. So I think there’s more conversion now and something which I, for example, work a lot on. So my last, let’s say my last mission, maybe it’s, if it’s my last, I don’t know, but my biggest one is really crossing the whole marketing and product behavioral data with revenue data which then makes it really interesting if you do this.

Shane: We had a customer and, they were a scale up.

And so we needed to do some work around . Standard metrics for a startup. The pirate metrics, a r, all those kind of good things. And when you first go into those areas, you go, how hard can it be? And then you find out

And it’s damn this should be a solve problem right now.

The core metrics for startups, should be well-defined. They all use, standard type of software as a service products for capturing subscriptions and signups and, all those kind of things. There’s three or four of them that everybody tends to use. It shouldn’t be hard, it should be patterns out there that you could just implement and make it really easy. And again, whenever you say those words, you know it’s gonna bite you in the bum and then you start doing it and you’re like, actually for some reason there’s a whole lot of complexity just hidden under the top of that layer that bites you.

Timo: I just do MRR calculation. So it’s let’s say technically these are all the same metrics. So there’s, let’s say a whole metric set around MRR. But the tricky thing is , how do you define it under the hood? Because everyone comes to you and in every case, company tells me like, no, we have a very standard subscription model.

There’s nothing crazy in it. And then I always know, okay, yeah, this will hold for two weeks until we do the first investigation into all the edge cases, and then we will just cover all the edge cases. Ah yeah, we, there we had to do something different. Oh yeah. And then yeah, we had this switch from one subscription to the, and so it’s always messy.

And that’s the same with product stuff, I think the pirate metrics are still a really good framework to understand how product, and let’s say this business around is evolving. But let’s say just go for something like retention or active user or activated. So activated by default is in the end one metric, but it can take you weeks and sometimes months to redefine how you basically calculate this because it really comes down how the product works, what kind of use cases are you actually solving.

So it’s a fascinating, interesting thing, but it’s also good ‘cause in the end it’s one metric that you just see, but it’s an iceberg. So on the top you just see this one metric, but quite complex underneath. But this is where the fun is. If you really do a really good job there and the definition and so on, it can really change the track for our company.

Shane: Even then though, there’s still outside of the metrics, there’s some core patterns you’ve gotta decide from an engineering point of view. So again, when I look at product analytics, when I look at event tracking in a software product, typically you have a choice of pull every event. so effectively pull event logs to say lots of things happened, and then apply the events you care about after the fact.

So more of a Lakey type pattern. Or only define the events you want in the product when at the beginning. And then only bring those in and use those. And if you want to look at a new event, you’ve gotta go and define it again. And there’s a trade off there, right between this,

If you get every event, you get this wash of noise and it’s really hard to figure out where the signal is.

But when I worked with a lot of product teams, they seemed really reticent to define the events that were important. It was easy for them to say, I’ll just take everything and we’ll look at it later, versus this is the event that will tell us that this feature was successful.

That’s actually quite a hard metric to define,

Timo: it is, this is this is another this is my second passion topic and this is the one where I wrote a book about. I think the problem is when you go into this, you have to distinguish between tracking interactions and tracking actually product usage. And so most of the setups are tracking interactions.

Where people click, and it’s quite understandable why you do this because you have to define something. So it’s not it’s not everyone’s job to define what you want to track. So it’s usually a side job that you have to do in between. So the, let’s say the most approachable way to do this is you open up your own application, then you see, okay, where can people actually do something?

And so they’re like, okay, they can click here, they can click here and they can do this. And this is what I mean you track interactions. If, let’s say, if this whole output is just for machines, then it might be. Okay. We are still not there. That, let’s say you can collect all these, let’s say very noisy, very granular data, and then you have something running over it, finding some patterns.

So far I didn’t really see things that go in this direction, but it might come at some point, but when a human has to analyze it, this doesn’t work. It’s far too far away from what actually defines product success or what actually defines business success. So it has too many noise, as you said, and so therefore what I ask product teams is define the use cases and define the jobs to be done.

Define the entities that make your product. So in the end, it’s classic domain driven design, what I do with them. And so okay, let’s define the different kind of entities that build up your products. It’s usually five or six. And then we define how does the lifecycle look into this? So let’s say we have an account can be created, can be updated, can be deleted. So three events updated usually you don’t need because there’s not often a business value in it if someone update an account. So who would optimize on account update updating usually. Yeah. Any there edge cases. But, and you can do this for everything else. You can do it for subscriptions. You can do it for whatever your product is built off. And then you can just end up with. This is usually my take. You end up with 15 events and the same, you end up with 10 metrics. It’s sometimes really quite crazy if you really have a good setup. So right now, I was just working on a project where we had one event that was defining six core metrics that could explain if the business is running or not. Because there’s always, this is always my interesting take when I work with the client is in every setup you have something which I call a heartbeat event. And this is this is a central event. If you just need to track one event, this is the one that tells you is your product still alive or not.

So for Slack, messages for Miro, if let’s say an asset, it’s added to a board or so, you have always this one event where you know, okay, when this is still coming in, things are good, we don’t know about it, but they’re still there looking good. And you have to get to this level, then you can basically tame the chaos that product analytics can cause.

Shane: It sounds a bit like a North Star metric. This idea that yes, all the other things are important, but actually if you can define the one thing that really is the core, focus on that and the other things are useful, but if you don’t focus on that one thing,

Timo: I always have to fight a little bit for the North Star metric because there are so many definitions out there, and I think the worst case is when someone comes to you and says yeah, new MRIs or North Star metric where you say, no, it’s actually not the, let’s say the concept of a North Star metric is it’s always bound to successful.

Customer experience, it has nothing to do with revenue. Obviously we hope when people have a good customer experience that let’s say it, it has a cau connection to more revenue. And then the North Star metrics is usually also built up by two or three input metrics that lead to it. But in the end, it’s the same what you said.

You have a set of three metrics that could explain how is your product actually delivering value at the moment.

Shane: I’m with you is that, the North Star metric is success for the customer, not success for the business because if the customers get the success that you promised them, then your business will grow. But yeah, time and time again, I see North Star metrics being internal metrics. And it’s maybe we should just go read the definition again or call it something else, south Star Metric or East Star. Like just give it a different name. That’s okay.

Timo: Yeah. You can always call us like this is our, yeah, I don’t know. I think North Star Metrics is just a great name, so it’s a great label. So you lo everyone understands what it is. It sounds great, like adventure. So this is our North Star. We have to go there.

Shane: That’s my standing joke is that as data people, we laugh about the fact that our stakeholders can’t define active customer, active , subscription. Yet as in the data domain, we argue about the definition of a semantic layer, a data product, a north star metric, day in, day out.

So we’re as bad, if not worse than them. So on that note let’s go and talk about metric trees. So if somebody said to you, what the hell is a metric tree? How would you describe it?

Timo: First of all the same. I think it might be also a definition problem. So some people call it driver tree. There’s also like the term of kpit. I’m not really good in definitions. I know that they have slightly different variations of it. So I give you my explanation and my definition of it. So one problem, what you often have with metrics is when you take a metric alone, so let’s say you have a software as a service. So you’re not really sure what you should measure. You are on LinkedIn. One of this posts where someone is posting, Hey, I just compiled the greatest collection of software as a service.

Metrics, like common metrics, and you get my PDF with this 100 metrics. These things never really work for someone. And the problem is why they don’t work. It’s because everyone picks then a metric and looks at it in isolation. So I don’t know. Yeah. For example, MRR, monthly recurring revenue. So what is you have a standard alone there and then you don’t really have an idea what you should do with it.

Because two things are missing and what defines the metric? We mostly is it’s missing a relationship. So let’s say every metric has a relationship to another metric. They have I cannot even think about one, which is really standing alone. So usually you have a relationship and why do you need this relationship? Because not every metric is very actionable. So we talked about this North star, when you say, Hey, revenues are North Star. So the problem with MRR monthly recurring revenue is. It’s an output metric, so it’s something that comes out at the end. So when the CEO goes into a meeting and let’s say assembles all the heads of the different teams and says we have to increase MRR, everyone would say, yeah, sure.

Sure it’s our business model, so it makes sense, but no one would immediately come up with an idea, oh yeah, sure, we should do that to increase more MRR. No, you usually break it down. So you break it down into, oh, okay. When we need MRR, and maybe we need more accounts because they could end up in a subscription and then this would be new MR. But then this is interesting because this is already the first step to build out a metric tree because then you have MRR, and then you say, okay, what is actually making up MRR? New subscribers multiply it with an average. Plan price or average price, and then new subscribers you can build up from new accounts with a conversion rate. And then you get into your first version of a metric tree. And I think the nice thing about a metric tree is that it explains how you can get from something that you will end up with to something that you can directly influence. So if you go to a marketing team, tell them , Hey, we need more new accounts, they usually have an idea what to do. So they will like, oh yeah, okay, maybe we have to run more webinars or we have to invest in a podcast or whatever. And so there you make it actionable. And at least this is my driving force. Why metric trees are interesting because they help you to break down something that you want to achieve to something that you can actually do. And this is the tricky thing. So if a metric is not actionable, will have a hard life. It lives lonely on this dashboard and no one has an idea what to do with it.

Shane: So if I think about it what you’re really saying is there’s a bunch of metrics and they have a relationship to each other.

And we treat it like a tree. So if we think about an HR human resources org chart where we typically see a tree of people and people report up this tree, what we are saying is if we can find the relationship metrics that behave that way, so these metrics support this other metric which supports this other metric, then that relationship helps us work out what we need to change in our business to move one or many of those metrics.

Is that kind of the concept?

Timo: Yeah. That’s the concept. You can also use it for that. Where I like to use it is to really try to explain how the business works on a, let’s say, common concept that at least the data team understands and that the business team understands. And this is something which I would say the metric tree so far worked best.

So I tried different kind of formats to bring both teams together. We did it. Okay, what kind of events do we have to measure? Usually it doesn’t really work well because it’s too abstract and, You sometimes find common ground, but it’s too far away from what people really care about. If you really build up a metric tree, you have really interesting conversations with, let’s say, when I do these workshops where it’s usually with startups where it’s easier to get relevant people into one room.

So then we have people from all the different kind of disciplines. And we have really good conversations because for a lot of people it’s for the first time that they see how actually the revenue is coming together in the company. So this is let’s say often, unfortunately for data people, it’s the first time that they see, oh, this is how we calculate revenue. Because it’s usually one person who does this part, but not everyone and then for products the same, they’re like, oh, okay. I was not aware that we also do this to get revenue. So it’s a really interesting exercise that can, open up a lot of black boxes and understands how the whole system’s actually building up. And then. Obviously, if you go a little bit more down, then also, all the different teams might find a place in this tree where they say, oh, this is actually where I work hard for the data team. They usually don’t find a place there. But they’re the ones who can build the metric tree and can provide the metrics. But let’s say you can have a part where you can see what product can do, you will have a part where you can see what sales do. Then you will have a part where you can see, oh, this is actually the influence area of marketing. And then in theory you can analyze how these different kind of levers that you have, let’s say on a lower level, how they’re actually attriting to the whole final output metrics. Every time I did this kind of format in a workshop, it was always some kind of eye-opening in there where someone said, oh, I never looked into this. Or someone even said actually right now, I think with some initiatives we are starting to hurt our revenue and so on.

No, no one has recognized that because now we see it for the first time that when we do this, let’s say this stream of revenue will have a problem, which no one really thought about. And so this is something where metrics we definitely can help to bring transparency into. Let’s say the mechanics, how the business actually making money that often is completely overlooked because a dashboard in theory does it, but it doesn’t have the structure.

The tree structure is nice because , someone can look at it and immediately understands what it does. Yeah. So go from the bottom up to the top or the other way down.

Shane: I think that’s part of the core value of it, is that simple map. ‘cause what we know is we know we, there’s a lot complexity and when we draw a simple map, a simple diagram, two things happen. People can visually ingest it and understand it. There’s something about a human looking at a map that just naturally sees the pattern.

And the second part is you can point to part of the map and everybody knows they’re talking about the same thing. So that idea of different parts of the organization, different silos, understanding how they affect another metric. So you know, your example, how do we increase monthly recurring revenue while there’s a bunch of metrics that the marketing team can affect, number of ads placed, number of prospects found optimizing the funnel from prospect to signup.

And they can look at that part of their map and go I can do some work in there. And that in theory will increase our monthly recurring revenue, because that’s up the tree. So that idea of a map gives us understanding and also gives us a clarity of a shed language, i’m focusing on this part of the tree.

Timo: Yeah, I’m very happy that you come up, that you say the word map because this is actually what it is. It’s one map that can be very useful. So at least when I create one for the projects, it’s not that we constantly work with it. So we might get to this point like how operational are metric trees, but where I always use it is when we come together and discuss big things.

Okay, what kind of metrics we should focus on? Let’s say what kind of initiatives we should look into. So this can always help to have a then sitting around and when we say, okay, look, now we want to focus on this kind of area, you can always look then in the, let’s say in the tree.

And the nice thing is when you have the tree with some data, then you can see, okay. We focus on this kind of area, , because we have the feeling we have a lot of volume here. But we can see for example, the conversion rates in this kind of areas are not really high. So there’s, let’s say there’s quite nice potential for us to improve things just slightly a little bit, but see a big impact because we see that a lot of push comes in there. , And so then we can analyze how this works. For that, it’s really nice to have it because again, everyone immediately knows where we are. So it’s not that you have too long. Explain how this might have an impact on revenue, because immediately everyone sees that

Shane: The good thing for data teams is it means they don’t have to go look at yet another stupid strategy PowerPoint with four boxes that have no context and no understanding and no data.

You might have to, but at least they can say that, that part of the pyramid or what are we doing this week? The circle, whatever the latest consulting picture is for for selling a story.

How does that map to the tree? One of the things you said though was data teams would struggle to figure out which metrics are theirs and that, that kind of gave me an idea. So, you know, One of the things I work on a lot is this idea of an information product, and one of the key things I teach teams is focus on the action and the outcome.

If you don’t understand what action the stakeholder’s gonna take and what outcome’s gonna be delivered from that action and the value of that outcome, then really there’s a risk. You’re doing data work that has no value. While we say stakeholders, were accountable for that actually as data teams, we should be as well, we should be holding our stakeholders to account that actually they can describe the value, at least the outcome,

That’s been taken.

So what would be interesting is if there’s a metric tree in place, a data team, which should be able to point to the metrics that they’re doing data work for or with to improve, i’m working with the marketing team to use data to reduce the time between a prospect being identified and signing up for an account.

And so they should actually be able to use the metric train to show where the data work is adding value to the organization in conjunction with those stakeholders, those different business operating groups,

Timo: yeah, exactly. I think this is where it can really play a really nice role because can sit in these, let’s say you work with the marketing team. Marketing team does some planning for the next two or three months. So they have some ideas let’s say they come up with three initiatives, what they want to do. And so you can take every initiative and you can then say, let’s say, okay, here on the big company metric tree these initiatives. They’re trying to improve these kind of areas. And then we can use this metric there as always, as our our control instance to really see, okay, do we see some impact? Because that’s a tricky thing. You can run a lot of initiatives and locally looks really great, but when you look at it on the big picture, everyone is yeah, I don’t know. AB tests look great, but I don’t see any kind of uptick somewhere. So it’s definitely nice to have, okay, this is the one big metric that it should, let’s say influence and let’s see if we can fire it up enough that we can see some influence. Then what you can do. The second thing is let’s say you can create different views of metric trees. So you can do this very high level for a company, but then you can also take every initiative and just build a metric tree for every initiative.

This is something which I often like to do because again, it helps you to understand, okay, how do we measure the success of this initiative? So you might have this one big metric that we identify in the company metric tree that let’s say is at the top end. And then we break it down for this kind of initiative.

And so then it really comes, let’s say you want to improve conversion rate. So then we can build the whole surrounding around this conversion rate and maybe even break it down by three different channels because this is what we are trying to achieve. You’re trying to get more people from this kind of channel.

So then you basically bake strategy into the metric tree and the strategy connects. Directly to the initiative that marketing is driving. And then everyone has a very clear idea what you do. And then also you will build something that in the end, can marketing and you can use to explain if this initiative was successful or not. , And you can validate it before you can ask people, okay, look, when we deliver this, does this make sense to you? When, let’s say when the initiative is over in four weeks and we report on this metrics and we define, okay, success looks like when we move this part, I don’t know, by 10%. So does everyone agree with that? And then because I have the feeling it makes it for a lot people easier to think in that way. So that they not don’t say, yeah, I don’t have really idea if I should agree or not to this because at least, yeah, I think metric, at least they understand. They might ask, okay, how do we define this kind of metric? But that’s fine. And so then we can spend some time to say, okay, how we define it. But I like this approach a lot. Also like to not always see the metric tree as this one big, okay. We explain the whole business model, but really to use it to explain how I run a specific initiative. It’s also nice for me. Let’s say personally when I work on these, let’s say, as a supporting part for data for these initiatives, it’s a great brainstorming tool for me. So let’s say someone comes up with this initiative, I do a first version of a metric tree. Then I say, does it really represent what they’re actually doing there?

And then I say, yeah, to a very generic part, but maybe not customized enough. So I’m always trying to tweak the metric tree that the people who run the initiative immediately find it in there. So let’s say I do something for e-commerce. And the e-commerce is really pushing and getting high loyal customers. Let’s say , they really want to improve the segment of high loyal customers who buy all the time. There’s these, let’s say, high buyers that you sometimes have. So when I just report on let’s say new customers or returning customers. Then I don’t really have it covered because yes, we are going for returning customers, but for a special segment within returning customers.

So therefore what I can do is I can say, okay, I break it down in three groups and new returning and let’s say VIP customers and then with the VIP customers for the first time, I make it visible what kind of impact the initiative right now has. Or let’s say I can see, okay, how is the share of these users is growing or not growing or whatever. And so this gives me a lot of, playroom to build something that can really support what the business is trying to do, but do it in a language or in a way that at least most of the people understand.

Shane: All righty. So I’ve got a shit ton of notes right now. So let’s let’s go through them one by one, because there’s so much gold in there. So the first thing you pulled out is this idea that actually metric trees is a shared language. Yeah. So you can, you’ll define a metric tree for what you think you heard from the organization of how their business operates.

And by putting that map, by putting that tree in front of them, they’re gonna identify where you’ve got it wrong. They’re gonna look at it and go yeah. Oh no, hold on. We don’t do that. Or we are different or, I don’t understand. It becomes that visual shared language of an entire business and their operating model

Timo: it maps the process to some degree, and that makes it easy for people to see, okay, is it actually what we do?

Shane: Then the second thing is you said when you want to go and do something new, you wanna do an initiative or some investment or change a process or go into a new market, you can pick the metrics you think you are gonna impact and you can guess how much you’re gonna impact them by. I remember many years ago, it was probably late nineties, early as two thousands, I got into the whole balance scorecard thing.

I was working for a vendor balance, scorecards were hot. A couple of ‘em were trying to build software. And one of the things was, you had this metric and effectively you could put a budget on it. We are gonna increase that by 10% over the next quarter. Now the software itself was pretty hokey.

But it was the conversations around what are we doing and how’s it gonna influence that metric. So that’s one of the key things you called out , is by saying that if we do these things, let’s guess which metrics are going to change, is that what you’re saying?

Timo: Yes, exactly. And then also really make sure that you measure this metric in the right kind of way. So that, for example, let’s say you have a corridor that you can see, okay, is this a normal movement of the metric or not? Or that the initiative is big enough that it can move the metric. This is a tricky thing. But often. At least when I work in product or marketing is that often you see a lot of initiatives that are just, let’s say they’re very tiny bits. ‘Cause I don’t know, they’re not really bold. And then it, you will not see anything. And you can even tell this before, okay we try to optimize 5% of the typical audiences that we get in there. So we are working on this. There might be still a strategic reason for that, but then you can still make it clear. It’s okay, so we work on that. But because , let’s say the sample or let’s say the audience is so small, we will not see an impact when we look at the completely blended global metric. So then, for example, you have to break down this metric maybe in different kind of path that you can still see something. But this check-in really helps to see, okay, how do we actually want to see if it makes an impact or not? ‘cause sometimes just because of the setup, you won’t see anything. You can already know this, that it might not happen.

Shane: And that was one of the problems with the balance scorecard. So we had this idea of cause and effect, we said that, I think you call them input and output metrics, but this idea that if I improve this metric on the bottom or the left then it has a cause and effect with a metric above it or on the right and.

Back then, we really wanted to find correlation, we really wanted to say well, actually, if we just throw all the data at the machine, the machine should be able to find causality or correlation between the metrics, I should about to codify this. It should be accurate. And we never got there.

We still don’t have the technology or the patterns to do that at the moment.

Timo: no. I don’t think so. So there, this is definitely not my area of expertise, so I know some people do some things like that, so they try to find out, , because you have some relationship when you build a metric tree that are not deterministic. So let’s say a classic metric tree, you can write as an equation because it’s basically that’s also quite nice that you can do it. But sometimes, for example, let’s say you have a metric which is called active users. And so you have a very well, well-defined active user definition where you say, okay, the people don’t just show up, they also do valuable stuff within your product. And so once they do it, and they do it within the last 30 days, we basically flag them as active users. So there’s definitely a correlation between active users and at some point starting a subscription, let’s say you have a free plan, and so you have an active user and a free plan. So there’s a correlation between the both, because you might assume, okay, someone has to be active to end up in a subscription. But let’s say it’s, it’s it’s not a direct connection, so you cannot really say, okay, whenever we get someone as an active user, we have a probability of, I don’t know, 57% that they will end up in a subscription. I guess this is more straightforward to calculate, but even there, I never really, came across, let’s say, super good models that can , predict this very good.

If someone knows this, please let me know. Ping me on LinkedIn. I’m always interested

Shane: Now the answer’s just AI slop, right? You can just AI it

Timo: Yeah

Shane: and maybe you can, maybe it’s good, that non-deterministic patent matching will be really good at this. I dunno, I haven’t tried lately,

Timo: So if we go to this, what you had before, like we, we create a lot of noise and we might throw it in and we might find something yes. That’s, yeah. But I think for correlation, analyze is potentially not,

Shane: but the key is the. Visualization of a map, the conversations of what on the map, the conversations of what you’re doing to improve the numbers on the map, the relationships that actually is where the value is. So yes, we could get programmatic deterministic correlation across the metrics. That would be awesome.

But actually the conversations so the next one is, our natural reaction is to boil the ocean.

If I think about an information product that’s lifetime value. I tend to say to organizations why don’t we break it down? We know that to do a lifetime value model, we actually have to have revenue cost to serve churn, there’s a whole sub submodel sub products that have value. Why don’t we build the revenue model first? . And give you a revenue information product. Why don’t we do the cost to serve second, and then over time we’ll get to that lifetime value. I can imagine with metric trees, the natural reaction for some data people is to define every metric.

So draw the map end to end. Define every metric before we even start doing anything. Where for me I like the ability to change fast. So for me, I would be going maybe sketch out a map really quickly as kinda like a blueprint of what we think , it looks like, and then focus on some metrics and define them and build them and deliver them and monitor them and use that information.

And then. Kind of color in the map over time as you learn more, ? How do you do it?

Timo: so I had a phase where, let’s say I was in the rabbit hole and I did one exercise where I was trying to map my whole content production as the most extensive metric tree possible, and it was, I still have it somewhere in Miro. It was really a monster. It was like no one could ever implement this. I think it had in the end, I don’t know, 90 nodes. And obviously this doesn’t work, so I could easily see that. It was a nice exercise, was a nice thing to do over the weekend. But it’s nonoperational so it doesn’t provide any operational value to do this. And by the way, it would not even possible to track the whole thing because it would include a lot of attribution stuff, which is not possible. So the same, what you say, I think like the model that I used now is. Try to really keep it simple. I don’t know. Let’s say also try more to work with sub trees and don’t be too, let’s say don’t be too deterministic with it. So give yourself the liberty to say, Hey, I create a, let’s say a specific tree for this kind of new product feature that gets introduced and have no idea how it connected to the other tree.

Now that’s fine. I live with that because local optimization is still better than no optimization, so therefore be nice to yourself. But no, you definitely have to keep it. Quite simple and then you have to know what you can do with it. So I know that some people use it for root cause analysis, and I think for that it’s really quite nice because I could you, you can say, okay, look, we lost so much new revenue let’s say, compared to last month. So can we follow up with the tree to really see where do we see where stuff got off? So it can help you to, let’s say, get a starting point, but then still, the deep dive is something different. The metric tree will not tell you where it happened and then. Another thing that I discovered was, is often overlooked, especially when you talk about metrics and the metric tree is, it’s really big risk is you have time is completely out of this thing. So the metric tree lives in a timeless space. So the problem is for example, you increase new accounts in the example that I had before. And then you wonder, okay, why don’t I see any kind of new revenue? Yeah. Because it happens in 60, 90 days. So whatever your model is or whatever the average time is that people usually take to upgrade into subscription.

So the metric tree always does a very bad job where things, let’s say down in the branches are already impacting and you see the impact three months later at the top and you have no idea where it’s coming from so if you would do it properly, you would do a cohort analysis where you cohort everything that happens something.

But this doesn’t work on a tree. I think the important thing about working with trees is really to know the limitations, whether it’s great or not. As I said, for me, mostly for planning, for brainstorming, for communication. I guess some people use it also. For these check-in meetings, let’s say have a weekly meeting, you check quickly, okay, how’s the business performing in specific areas? I think that can work as well. But it’s not like that. Some people thought, oh, get rid of the dashboards and we just do metric trees. So no, that doesn’t really work. You still need specific, let’s say explorative dashboards to figure out why a metric is actually looking a little bit wonky this month.

Shane: I think back in, in the balance scorecard days, there was this concept of simulation. , It was almost like a digital twin. And although that term wasn’t around then from memory, so you could actually go in and say, if I change this input metric by a certain amount what’s the flow through?

And from memory we were inferring a delay. like you said, number of prospects coming into the funnel, how long would it be before they create an account and go into a pay plan? How long would it be before that money turns up if it’s a 30 day window for the bill? And so you bake that into the model to a degree so you could simulate some changes, but, that was quite technical.

Timo: What I often do is I take a metric tree and I just break it down on a spreadsheet and then create something which we can call a growth model, where I then have the metric tree on the left defining all the different kind of rows, and then I can put the timeline on the right in there, and then I can just model what you just described.

I can do some forecasting. And for me, the interesting part is I see the mechanics of the business and then look, I can change some conversion values and can see, okay, what impact does it have when we actually increase this by 10% and so on. And then I see it how it plays down. So it’s a very basic and amateur way to do forecast simulation, but often quite enough for most of the stuff

Shane: And again, , that relationship’s an interesting one and you just made me think about it. So we’re doing a whole lot of work in what we call the context planes. It’s this idea of taking the context of our data and bringing it back into, for us it’s centralized.

And we think about it as four types of context we think about as business context. Actions, outcomes, glossaries descriptions, those kind of things. We think about it as a structural context, so physical schema data types and all those kind of things. We know about our data. We think about it as operational context.

So when was data loaded, when was it refreshed? What’s the quality score? Those kind of things. The last one we think about is agent context, the prompting, reinforcement, those kind of things. And so there’s a bunch of object types. We have, so we have, actions as an object type outcomes, an option type metrics as an object type a fact, a value as an object type.

Then the last part of it is the relationships across those object types. And the reason we’re working on this is this context map. Effectively, if you give it to an LLM it’s really good at using it to answer questions. So you can do what we call blast radius. If I change this bit of code.

If I change this object, if I change this metric, what’s impacted. And so thinking about metric trees, they’re giving us a relationship across those metrics, which is effectively describing the business model for the organization. And that relationship would be really valuable to an LL lm if you’re using an AI agent because you can say, if I touch this metric, you know, what the relationships are infer what’s gonna happen, and that actually might be a non-deterministic way of getting a simulation quicker than having to build really complex data models.

Timo: Yes. So I was experimenting a little bit with that. Let’s say in the easiest way, you just do a copy of of your metric tree and put it in and then say, okay, can you please run this in specific ways. I also did some experimentation with, let’s say, descriptive YAML format for metric trees was not really happy with that. So I definitely, so far I have abandoned the idea to do this. But no, you’re right. Also because when you put it in LMM, then for example, you can ask, okay, can we run a cohort simulation on top of that? So let’s assume we run this initiative in the next months where we think, okay, we will increase this kind of metric, and then we want to see how the impact will look like over the next month.

And what I often see. For example, when I have, let’s say full data set up for let’s say a startup. And then the metric tree is always you can use it in an l and m and then you can say, okay let’s brainstorm on this. So right now we have this problem here, or we want to dive deeper into retention.

It’s this is the whole picture, so can you help us to break it down? So can you give us different kind of versions of what, let’s say what would the level underneath look like? And still at the moment, this is my favorite LMM workflow or let’s say use case is really to use it as this massive brainstorming machine where you can say, look, okay, let’s look at this angle.

And I can say, oh, okay, let’s slightly change this kind of angle. If we bring this, and then because you have so many things, then. Mapped out to you so you can immediately say, oh yeah, this is the direction we should follow on. Let’s go deeper there.

So it definitely helps to not do crazy stuff because you give it a, let’s say, a form or structure

Shane: And then my natural reaction is, oh no, because LMS are non-deterministic, and so the simulations it’s gonna run is wrong, but we’ve just already said that, you Correlation of cause and effect across the metrics is hard to do if not impossible. So a non-deterministic engine is just behaving like a human going, this is what the patterns we see.

So actually using it to do those what if analysis, those simulations, those, how are those things related is just as valuable as humans doing it. I hadn’t thought about it that way.

Timo: and the Usually nowadays would, take the other approach and would write a Python program.

Often, when you ask these things, they say, yeah, let me write a quick script for you. Usually does that, and then you can just check.

Shane: Interesting space. I think , we’ve seen metric layers come out as bi semantic layers. And they’re not, from my view, they’re not making it right. They’re not really getting traction on the market. Maybe we’ll see a reinvigorated of the balanced scorecard products as metric three products.

‘cause there’s probably some value there. Alright so let’s just go back to basics, i’m rocking into an organization, or I’m an organization and I wanna start off this journey. We’ve talked about the map is the most important thing, the shared conversation. So we need to build that out over time.

And we’ve talked about doing it, step by step, don’t bore the ocean. We’ve talked about the fact that actually defining metrics are hard. . So each one of those you’re gonna think is, oh, it’s only MRR. How hard can that be? You’ll find out. So we know that each metric to define it and implement it actually is a lot harder than we normally think and takes time.

Where do you start? , So if we think about this idea of input and output metrics and metrics in the middle that we’ve got this idea of a heartbeat metric, the core metric for your organization, that actually is the one you want to look at the most, and that’s gonna be made up of a whole lot of input metrics, we know that. Where do you start? Do you tend to start at the input side? Do you tend to start at the output side? Do you tend to start in the middle? , How do you decide which metric to do first?

Timo: Huh, that’s a good question. I start on two areas. Area number one is I usually bring, let’s say, a set of people of this company into one room. That can explain, let’s say the whole customer journey or the whole customer flows in this company to me. I usually work with, let’s say startups, which are between seed and series A.

So therefore, the whole business model or the whole business processes are not super complex yet. , So it’s definitely possible in three hours to map the whole thing out. So I do a classic event storming session. So not classic event storming is a little bit broader and it does more things. So like I do a reduced version because I’m just interested in specific things. But. When you ask a company to explain how the customer’s basically going through the whole process, then you will identify these things. So you will identify, okay, what are the actually important points that they have to come to that in the end? Like it means success for you. So we will find this one place usually pretty quickly where we say, okay, this is a sweet spot. So that we definitely have to have a metric for that. We just have to see how we define it. For example, an active user is a North Star metrics in general. It’s it’s a definition process that takes longer because you come up with a hypothesis where you say, okay, people have to, okay, I’m just making this up, let’s say when you’re miro, so they have to create one new dashboard every month. They have to share at least two. And then they have to add at least 20 cards on a board or so, so then you would say they’re active. So this definition has to be fine tuned over time to see, okay, does it actually stick? This comes later. So I start with understanding the whole process. This is part number one, part number two, and this was something that I learned later. So I started very early on with these event storming maps because I use it already in other projects for other things. They’re really great to understand the process very quickly. The part that was always missing for me, that came later for me was. I sit down with the leadership team to understand strategy. I really want to understand where they want to go in the next six months. So what is the thing they want to improve which kind of direction they want to go, because from there I get the priorities, what you just asked.

Okay, where do we get started? So what do we have to build out first? I always try to bring it as close as possible to the strategy that the company right now is doing, because if I do it, I usually get a lot of more interest by management teams. By everyone I mean then by everyone else because they have to report towards this. I won’t do this, people would like. Yeah. It’s interesting, but it doesn’t really. Let’s say count into our current initiative. So if I fail to support a current push, strategic push into something, then usually I get a really low adoption rate. So this is like the second, I would say most essential part that I really try to understand, okay, what they’re trying to achieve in the next six months.

Shane: So once you’ve blueprinted out or, event stormed the kind of, I, I think of it as a blueprint, you’re creating a hypothesis of what the metric tree might look like when you’re finished. Then what you’re saying is you then look and talk to ‘em about their strategy to figure out where you should zoom in on you’re basically saying as part of that event, map that metric map.

And based on your strategy. I think we should do these ones first because they seem the most valuable, the ones that’ll get the most traction or have the largest conversation and adoption internally, which is the goal, is to get more people to understand it and use it and go, yeah, these are valuable.

Let’s do the rest of them.

Timo: It’s even slightly different. It’s not like just highlighting the areas. It’s often there’s a different version of a metric tree that I use. So you can take a metric and you can break it down, buy specific things. So this is the classic thing that you do in BI as well. So you have sales, you can break it down by something. You can do the same thing with the metric tree. So you can take a total metric and then you can break it down by some kind of segmentation, whatever it is. In this segmentation, I can try to incorporate the different kind of strategy. So I want to have, let’s say some, okay, let’s say, miro, I’m always using Miro because most of the people have used it before, so let’s say they push into AI supported workspaces. So then I’m trying to get up metrics that will highlight these, let’s say the adoption of AI supported workspaces versus non AI supported workspaces so that everyone on the first glance can immediately see, are we making progress with that or not? And so it’s really like creating a variant of the generic metric tree to say, okay, this is how it would be useful for the current strategy.

Shane: Okay so lemme play that one back because I just I want to make sure I get it. And is this idea of the buy statements of these metrics, I want this metric buy channel, before we started, we talked about the tools we use to create our content. And we talked about the fact that I use InCast for recording, I use Descrip for editing and I use Podbean for publishing.

And now what I’m seeing is a convergence. Yeah. I’m seeing each of those three products add the other features , that I do need into their product. And I was ranting about how that actually degrades the value of the one thing I use ‘em for. So let’s say that we’re zencaster and we wanna move into the real time streaming and video editing.

So what I’d be looking for is the metric of. Active, whatever, maybe active user, maybe active sessions. And then I wanna be saying, okay, how many people are moving from audio recording only to video recording? Because that’s a input metric. If they’re not doing that, they’re never gonna use our video editing feature.

And then how many of those move from video recording to video editing and then segmenting it maybe by region? And we say actually the US is the target market, so how many people? And by doing that, we’re effectively going, there’s a really small metric. Effectively, we’ve broken it down to its smallest part, but it’s there to support our strategy of getting more people using our video editing.

And so that’s an okay, that, that makes sense for me.

Timo: No, Is a perfect example. So exactly like that. And usually it makes it a lot easier than to talk to the people and people who find it very interesting. They want to see the results from that.

Shane: And you can get instant feedback. ‘cause somebody can say, actually yes, we are going after the video editing market. That’s one of our initiatives. ‘cause we all know that most organizations have 52 initiatives every quarter. But they can say, that’s not our most valuable initiative. This one over here actually is our most valuable and that’s what we should work on.

So when you are talking, you’re bringing a whole lot of patents to the table to support the metric trees. So you are talking about event storming. You’ve talked about jobs to be done. You’ve talked about domain driven design, you’ve talked about journey mapping.

There’s a whole lot of really valuable product and data and agile patents you’ve just talked about.

Timo: You can see. where I’m coming from. So it’s there’s a big product influence in this whole thing. Yeah.

Shane: They’re all valuable patterns that support what seems to be a simple pattern of metric trees, which is define a metric and say how they relate,

Who does the work then? Because if I talk about a data team, often I don’t see data teams applying product thinking or product patterns, they’re data teams. And so you get this idea of a data product manager now who kind of bridges it, but who is the type of person that would do this work given the level of patent understanding and experience that you actually need to make metric trees work to the best ability, not just defining a metric on a dashboard who do you see doing it?

Timo: That’s an excellent question. I don’t really see data product managers. Obviously because I don’t really see them happening so much. So from my experience when I talk to some people who use metric trees, I would say most of the times these are head of datas leading the data team.

Why is it good fit? Because they have a strategic role and some of these people struggle a little bit to , have a strategic role because they love data stuff. So it’s a tricky process to go from something which, was very operational to something which becomes more strategic. But let’s say the one or two people that I had long talks about were all heading data teams and they were. Completely concerned about, it’s okay, how can we do a better job to connect what actually the business is doing? They also let had a high frustration level, to be fair. So they define Northstar metrics without us, and so they, they do all these crazy things where we actually, as a data team should be involved, but there’s a reason why they don’t talk to us, so maybe we are not really well prepared for that.

And so this is where they invest their time. And I think for someone who’s leading a data team, I think all these things are good things to learn because they have a very strategic aspect to it. And You have the possibility to understand the business, which helps a lot.

Let’s say it’s the head of data, you are the. Chief sales officer for all the data work. So therefore, to have a good idea where your work makes most of the impact and gets you, let’s say, gets your team a lot of fame is definitely not bad to learn. So I would place it most likely there to not also create a different new role for that.

Shane: And it comes back to that value of the data team, value of the things they’re doing. And in my head, so within the information product canvas, we have this area called business questions, and the questions that we want to answer with that product. And from there I can infer metrics relatively easy.

And so therefore, if I think about it as a map, it goes, business question is supported by a metric. And an action is supported by a business question. If I answer that business question, what action are you gonna take? Now, when I think about metrics in a metric tree as more a business model, I’ve gotta think about where it fits in my, in the model in my head, because now I’m going, oh, actually, hold on. They’re not really a subset of a business question. They’re sitting in this patent map in a different place, and . I’m gonna need to think about that one. ‘cause that’s changed my thinking around metric trees, because

I just started off this conversation around it’s just a metric and yeah, I need a metrics layer rather than it is actually a simulation or a visualization of a business model.

Timo: Yeah. Yeah. Business process. So like both, It’s more in this direction.

Shane: yeah. That’s perfect. All right. So if people want to find you and find your writing and you run a course that teaches some or all of this stuff,

Timo: yeah, not yet. Okay, so a guest and I, so if, let’s say, if you look for Metric tree content, so you will come across, a small group of people who wrote about it. And so a guest is one, I’m another one. There are some others, Arby has written a lot of stuff about it. Ollie as well.

So we are working on a course but we want to do it right, so therefore this takes a bit of time. So we did already some iterations. We have some ideas now in general, if people want to read what I’m writing. So you can go to de odeo.com where I write my newsletter. We sometimes write about metric trees. Metrics, product analytics event data models is usually like what I write about. And sometimes I do this as well on LinkedIn. You can also just follow me there if you have a direct question. Or you can also just write me directly on LinkedIn. I usually try to answer them.

Shane: Excellent. I read your content. It’s great. You can tell when somebody knows your craft and then they spend the time to figure out how to write something that teaches it. ‘cause I dunno about you. I find I can write content quickly if I don’t worry about simplicity. But when I try and write something to explain a complex thing with simplicity, that takes me a long time to get it where I’m like, yep I think I’m happy.

And then testing it, give it to somebody and they can tell me what it actually means. That is a lot harder. And the way you write, I can read it and go, ah, actually I think I get it.

Timo: That’s good. No, like I just wrote a, I think this is the longest that I ever wrote, like an 8,000 word piece about the history and future of digital analytics. And I think this thing was brewing in my head for four months back and forth with different kind of variations. So like some sketches on a paper and then, ah, I’m not really happy.

No, this is the wrong direction. And so yeah these things take long until

Shane: so on that one , what we’re seeing now is we’re seeing generation of AI apps are incredibly fast and, 30, 40 years ago, we used to struggle with one system sitting on, mainframe or a, or whatever and getting the data. And then we moved to enterprise resource planning and CRMs.

So we always ended up with five to 10, and then we moved to software as a service. So we ended up with 15 to a hundred to a thousand systems. Now, with the ability to spawn up, an app in a heartbeat, we are gonna end up with a problem of, a thousand, 10,000 systems that capture data for an organization that allows that to happen, that’s gonna change the way we do product analytics, isn’t it?

Or is it not? Do you think that the techniques and patterns and technologies we have today are gonna be able to handle this idea that there are 20,001 shot apps in an organization that are being built for a very small use case for a very small set of personas.

Timo: That’s a very interesting question. I would say in general, not so for example, what really took me, some time to get to this point, others came to this point as well. Unfortunately it’s not so much teach and product analytics that in the end, the tricky thing in product analytics, where do you find an high abstraction layer that still explains the product enough, but it’s simple enough that you can, basically do some calculation and good stuff around it.

So , what really works well is the user state model, which is in the end a gross model. So you say, okay, we have a new user. You new user can become activated. It become active, it can become at risk. When it stop being active, then it can be dormant when basically nothing happens anymore, and then it can be resurrect and whatever.

So you can basically create a loop where people can move through different kind of states. If you take this kind of model, then you can have 400 things under it, as long as you can map the things that are happening in this. To one account. This I is always the tricky thing in product analytics. So as long as you can do this, you can still then abstract it on this high level. You can still say, okay, people do different things. Or , let’s say we capture different things, what the people are doing and let’s say these 100 internal tools that we are using, as long as we can bubble them up into one place where we can say, okay, look, when people show these kind of activities, we flag them as an active user. I would say the system still would work. I would say this is the only escape. If you go the, , classic product analytics approach where you try to track everything and then try to figure something out, obviously this is something you cannot win. You really have to go on a very high abstraction layer. I think the tricky thing is really how you do identity stitching. So let’s say the basic function of making marketing and product analytics work is you have to combine all these different kind of things. So if you cannot combine all the different kind of signals happening somewhere then obviously you cannot analyze them for this one account.

So then you basically have 20 phantom accounts that are actually one, but you have no idea that they are. And this I think, can be really an issue, especially when these tools can pop up everywhere and the company, you don’t have a concept to make sure, oh at least we should identify the people in similar ways everywhere.

Shane: Actually, you’re just giving me a. Visual map in my head of how you put all this together potentially. So effectively what I think you are saying is, when we have 10,000 apps that are one shot apps, the first thing we have to do is make sure we’ve defined the core concepts of our organization.

So concept of a user. Yeah. And that’s important. We know that anything to do with a user is important to our business. So that concept needs to be defined and held. And then if your application is touching a user, touching that concept, then you need to be aware of that. So you are touching the user concept, and that’s an important concept for us.

And then we have the metrics that are a form of statement around active user but also state of user, so there’s a state flow. So your application actually has to be able to do whatever it needs to do to tell us when there’s a state change for that. So we can define active or inactive.

Timo: the interesting thing is usually I like to model the state changes, so I don’t want to have the applications to refine it because we will play around with this. So we will have different definitions over time. We might even break it down. We might have six definition of an active user. This is a great stuff that you can do in a data model, in a data warehouse.

I usually tend not to have it already on the application layer. I just need to track all the things that are happening. This is why events are nice. ‘cause events I can use to derive a state change.

Shane: But to do that from a government point of view when you’ve got 10,000 apps, is you’re gonna have to say to them, in your app, when a user changes to this state, you need to push it out as an event that I can see.

That has to be a governed thing, is you have to actually do these events.

‘cause these are the core events, if you don’t do this, your app is not valuable. And then how do we know it’s not valuable? What we can do is show them the metric tree and say, if you’re not telling us that you are changing the state of a user, then this metric won’t move. And if this metric doesn’t move, then these other metrics don’t move.

And they’re our core North star metrics. Internal metrics, sorry, our estar metrics see I got it wrong then is like our internal metrics are North Star. No they’re not. North star metrics are about customer. So they’re our internal metrics, but they’re the important ones. So if you can’t tell us you’re improving that metric, then why the hell are you building that app?

And that it’s A combination of all those patterns and defining them early so that all those AI slop applications actually fit into this governance

Timo: I definitely have to make a case for AI slop applications. So I privately love to build them. Because now, for the first time, I’m a former product person. I still am a product person at heart, and my biggest problem was always. I need to see things that I can make decisions. So you can create a wire frame, sure. But it’s not really to have an app in front of you. So now you really have the possibility you can go down three ways. I would say, okay, I would do it like this. You can compare it and you can say, yeah, no, this makes totally more sense. So I think it’s really great what’s happening, but you’re completely right.

So it will create an interesting complexity for us.

Shane: Oh, we’ve seen this before and democratization is brilliant. The ability to put the tools in the hand of people that aren’t professionals in their art is great. , That is massively variable. We’ve seen it time and time again, but we’ve also seen the impact when we don’t.

Bring in the principles, policies and patterns that are useful. DBT allowed a lot of people to write transformation code, which is great. It removed the bottleneck of centralized data engineers who were never allowed to do it fast enough, but what we ended up with is 5,000 DBT, and I’m using air quotes here, models, and we lost the definition of a data model, of a conceptual model of those kind of things because we didn’t apply the policies, patterns and principles that were valuable. So I can just see with the ability for democratization of building apps, which is great, we are gonna have the same problem and for me, this idea of events defining them conceptual model and a metrics tree could be the things that we use to create those principles, policies, and patterns

Timo: that’s true. I also never thought about this in this way. I think it totally makes sense. No, you’re completely right. I’m a big believer in Democrats on one side, but on the other side, have a really good foundational concept that will tell you, where you define specific metrics that will tell you if the thing still work. So for DBT, my classic metric is how long does it take a new member of your team to understand the model and make a production commit. If you have 5,000 models, that’s will take eight.

Shane: Where, whereas my metric is what was the original time from a stakeholder saying they had a problem and being served with something that solved it with data and you’ve moved to a team of, 10 new analytics engineers and DBT has that time come down. cause if it hasn’t that’s great.

Busy work. Thank you for hiring more people and being really busy. And the other one that I often talk about when I run my course is the clock starts when the stakeholder says they have a problem. Not when that problem hits the Jira queue. Prioritize for the data team because if that takes three months, actually the stakeholders already said it’s three months late.

And it’s not the data team’s fault because they’re not allowed to work into the hits the team. But if we think about nodes and links and metric tum system maps now we just wanna focus on the prioritization process because that’s where it’s broken. That’s three months. And if the team take a day, it’s still three months in a day as far as the stakeholders.

So same kind of patents as you, eventually some form of storming, some form of jobs to be done, some form of. Domain driven, bucketing, some form of journey mapping. We can apply that to the way teams work as well. It’s the same set of patterns and they have value. Yeah, end of rent. But , I can see that metrics tree and that event definitions of those core events has been really valuable in the, in, in the democratization we’re moving to.

Excellent. All right, so at the beginning , you said, oh, I’m not sure we can talk about metric trees for an hour. And I said we’ll talk about lots of stuff as we did. . Hey, look, thank you for coming on the show. It’s been awesome. I’ve learned lots and I hope everybody has a simply magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Diagramming to Understanding Your Data Estate with Rob Long

Shagility — Mon, 13 Oct 2025 20:09:29 GMT

Join Shane Gibson as he chats with Rob Long about diagramming patterns that you can use to quickly document and share your data estate.

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/diagramming-to-understanding-your-data-estate-with-rob-long-episode-76/

Listen to the Agile Data Podcast Episode

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

You can get in touch with Rob via LinkedIn or over at:

The article we talk about in the podcast episode:

AtLongLastAnalytics

Data Producer-Consumer Diagrams: Understanding Your Data Estate

Read time: 8 minutes…

10 months ago · 1 like · AtLongLast Analytics

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Executive Summary

This document synthesizes key insights from a discussion with data strategy consultant Rob Long regarding the use of Data Producer-Consumer Diagrams as a powerful framework for understanding, managing, and communicating the complexities of a data estate. The core concept involves mapping the flow of data through a series of Producers (entities that generate/capture data), Consumers (entities that use data for a purpose), and Handovers (the interfaces between them).

The primary value of this approach lies in its ability to create a unified, workflow-driven view that integrates disparate organizational artifacts like organizational charts, technology architecture diagrams, and data lineage maps. These diagrams are not meant to replace existing technical documentation but to complement it by providing a more accessible, story-driven perspective. They can be tailored in granularity—from high-level conceptual models for executives to detailed technical maps for engineers—allowing for targeted communication.

Key applications include clarifying team roles and responsibilities, performing impact analysis for technology changes, defining the scope of data contracts beyond mere technical schemas, and identifying process inefficiencies. By visually representing the interplay of people, processes, technology, and data, these diagrams provide the necessary scaffolding for informed, strategic conversations and decision-making, ultimately reducing organizational friction and aligning data initiatives with business objectives.

--------------------------------------------------------------------------------

1. The Producer-Consumer Framework

The foundation of this methodology rests on a simple yet powerful set of definitions derived from systems thinking, providing a common language to describe any data workflow.

Core Concepts

Data Producer: An individual, team, or system responsible for generating or capturing data. Examples range from a physical sensor reading temperature to a CRM system capturing customer interactions.
Data Consumer: An entity that receives and utilizes data for a specific purpose, such as answering business questions, training a model, or generating a report.
Handover (or Link): The critical interface where a Producer transfers data to a Consumer. This encompasses the mechanisms and agreements governing the exchange, including data contracts, quality checks, compliance rules, and delivery formats.

This terminology is analogous to the systems thinking concepts of “Nodes” (where a job is done) and “Links” (the handover between nodes).

Fundamental Value Proposition

The central benefit of this framework is its ability to unify disparate views of a data estate into a single, coherent narrative. As described by Rob Long, “it helps give you a unified view of your data estate... together they give you that workflow driven kind of diagrams which help unify everyone and reduce organizational friction.”

Complements Existing Artifacts: It enriches traditional documents like vertical organizational charts by providing a horizontal, workflow-oriented perspective. It integrates views of people (org charts), technology (architecture diagrams), and data movement (lineage diagrams).
Workflow-Driven Perspective: The diagrams focus on the end-to-end flow of work and data, clarifying how value is created and transferred across teams and systems.
Establishes a Common Ground: By using a simple, intuitive model, it allows technical and non-technical stakeholders to engage in meaningful conversations about complex data processes.

2. Key Applications and Use Cases

The producer-consumer diagramming approach is a versatile tool with a wide range of practical applications for data strategy, architecture, and team management.

Mapping a Multi-Layered View

The framework is capable of mapping multiple organizational dimensions onto a single diagram, providing a rich, contextualized picture. This includes mapping:

Data systems and their interactions.
The flow of data through different architectural layers (e.g., Bronze, Silver, Gold in a Medallion Architecture).
Team design and the boundaries of responsibility.
The specific technologies and tools used at each stage.
The personas (e.g., Data Engineer, Business Analyst) involved in the workflow.

Variable Granularity for Diverse Audiences

A key strength is the ability to adjust the level of detail to suit the audience and the story being told.

High-Level (Executive View): A simple map with a few nodes (e.g., “Source System,” “Data Lake,” “Data Warehouse,” “Reporting”) can tell a clear, conceptual story without overwhelming detail.
Fine-Grained (Technical View): The same map can be expanded to show intricate details within each node, such as specific data quality rules, data mastering processes, metric definitions, and the technologies involved.

This flexibility allows for the creation of “variants very simply which tell different stories for the audience.”

Enhancing Handovers and Data Contracts

The framework places significant emphasis on the “handover,” treating it as a critical point for negotiation and clarity.

Identifying Waste: It helps uncover process inefficiencies, such as when a producer generates data that is never used or when a consumer needs information that is not provided, forcing them to perform redundant work.
Informing Data Contracts: The model is crucial for defining what needs to go into a data contract. It pushes the concept beyond technical specifications (schema, field types) to what it truly should be: “an agreement between two parties... a negotiation between the producer and the consumer about what’s needed for both sides to do their job well.” This includes non-technical aspects like documentation levels and support expectations.

Facilitating Feedback and Agile Processes

The diagrams inherently support modern development practices by visualizing necessary communication channels.

Feedback Loops: In reality, data flows are not purely unidirectional. The model highlights the importance of feedback loops from consumers back to producers to report errors, document issues, or request changes. This moves teams away from “throwing things over the fence” and towards collaborative problem-solving.
Agile Gates: It provides a structure for implementing agile patterns like “Definition of Ready” (what a consumer needs before starting work) and “Definition of Done” (what a producer must complete before handing off work).

3. A Practical Example: Mapping a Data Workflow

A concrete example illustrates how these concepts are applied to map an entire data pipeline, layering multiple dimensions of information.

Scenario Overview

A common data workflow can be visualized with five primary nodes:

Source System
Data Lake
Data Warehouse
Analysis/BI Layer
Report

Layering Information

This basic flow can be enriched with additional layers of context to tell a more complete story.

Derived Insights and Strategic Questions

This single, relatively simple diagram immediately provides significant value:

Quick Comprehension: It tells an instant story. An observer can see, “you’re a Microsoft stack... I’m not seeing Snowflake, I’m not seeing Databricks.”
Generates Insightful Questions: The map acts as a catalyst for deeper inquiry. For instance:
- Are the Data Analysts read-only in the Data Warehouse, or can they write transformations?
- Where are business metrics defined? Are they only in the Power BI semantic layer?
- Is there a mismatch between the GUI-based tools (Power BI, Excel) and the skills of a potential hire who is a “hardcore Python coder”?
Identifies Boundaries: The diagram visually delineates responsibilities. It becomes clear where ownership lies within a single team (e.g., Analyst BI to Report) and where cross-team handovers occur (e.g., Data Engineering to Data Analytics), highlighting areas that require formal processes for communication and problem-tracking.

4. Strategic Benefits and Implementation

Beyond tactical mapping, the producer-consumer framework is a strategic tool for driving change, fostering alignment, and building robust data capabilities.

Driving Informed Decisions

The diagrams provide “the scaffolding for meaningful conversation and decision-making.”

Impact Analysis: They make the “blast impact” of a proposed change instantly visible. For example, replacing a BI tool is not just a single change; the map shows it could necessitate replacing “one-third of our data layered architecture.”
Challenging Assumptions: By providing a concrete reference point, the diagrams allow stakeholders to challenge decisions with an informed opinion, moving conversations away from pure conjecture.

Collaborative Creation and Alignment

The process of creating these diagrams is as valuable as the final artifact.

Workshop Approach: A highly effective method involves collaborative workshops where a team jointly maps its processes. This is a quick way to document workflows and often reveals stark differences between how leaders believe work is done and how it actually is.
Synthesizing Perspectives: Another successful technique involves creating separate diagrams with different groups (e.g., technical contributors, managers, executives) and then bringing them together. This process uncovers misalignments in definitions (e.g., what constitutes a “data set”) and processes, leading to the creation of a new, shared “truth” that becomes the adopted standard.

A Tool for Strategy, Not Just Technology

The framework encourages a holistic view of data strategy, aligning with the “four pillars” of People, Process, Technology, and Data. It forces a focus on foundational questions before technology selection:

What does success look like and how will it be measured?
What team design and ways of working are needed to achieve the goal?

The architecture then becomes “a means to the end... how you achieve the goal, it’s not the goal.” This helps in designing a “minimal system to achieve the goal and to hit success” rather than an over-engineered solution.

5. The Underutilization of Systems Thinking in Data

Despite the long history of systems thinking in fields like lean manufacturing, its application in the data domain remains rare.

The Communication Gap

The reluctance to adopt these end-to-end mapping techniques may stem from several factors:

Hyper-Specialization: Data roles are often highly specialized (e.g., Scala developer, dbt modeler), with practitioners not always being taught to consider the entire system.
Lack of End-to-End Onboarding: Unlike factory workers who “walk the line” to understand the full process, data professionals are often not onboarded with a holistic view of the data flow.
Communication Skills: There is a recognized gap in “soft skills,” particularly the ability to communicate complex technical ideas to diverse, multi-disciplinary audiences.

Bridging the Technical and Business Worlds

Producer-consumer diagrams are a vital tool for bridging this gap.

Making Complexity Accessible: They distill complex ideas into simple, understandable stories. As Rob Long notes, a key skill is to “make information accessible, whether that’s just choosing the right type of diagram, using the right vocabulary.”
Educating Stakeholders: The diagrams can be used to explain technical concepts like data lineage at a conceptual level to executives. This isn’t to make them experts, but to build awareness so they can better understand the value and necessity of investments in areas like data contracts and observability.
Fostering Business Acumen: The framework encourages data professionals to become more business-driven by understanding how their technical work fits into the broader value stream, aligning with the principle that “as companies want to become more data driven, data engineers should want to become more business driven.”

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Rob: I’m Robert Long.

Shane: Hey, Rob. Thanks for coming on the podcast. Today I wanted to have a chat to you about this article you wrote that I really loved. It was called data Producers Consumer Diagrams, and Understanding Your Data Estate. But before we rip into that, why don’t you give the audience a bit of background about yourself.

Rob: Yeah, I’d love to. So I got started with data in academia. So I actually have a PhD in applied mathematics and geophysics. So I was doing mathematical modeling and numerical simulation of planets and stars. Things like sun spots and storms on Jupiter. And without realizing at the time what I was doing was bringing machine learning and data engineering to that field. ‘ cause these simulations generate big data and we were implementing. Supervised machine learning, but it was still a step up for that discipline group of people. Since then, I’ve dabbled all over the place, so I had a couple of years where I run a software as a service, helping schools in England improve their national ranking. So again, very data centric. And then throughout my career I’ve gone through and I spent a couple of years as a senior data engineer in defense in the uk, so really helping ‘em upskill and using cloud platforms, how they do data, and was using GN AI to help with content creation, so scripts, audio imagery, things like that. And now I’ve moved from the UK to the US at the start of this year, and I run a consultancy where I focus mainly on enabling analytics. So lots of data engineering work and data strategy. And that later part is really where this kind of producer, consumer framework came from as I’ve been helping. Companies new to data Build the foundation. So really getting them set up on the right path. And that’s how this kind of came to be.

Shane: Let’s do a little bit of anchoring around terms, ‘cause you use slightly different terms than I do, so we’ll anchor those. So just talk me through this idea of how you define a data producer, a data consumer, and the concept of a handover.

Rob: So I think a producer, a very broad strokes, is an individual, a team or system that produces, which means to generate or capture data. So that could be thermometer, reading temperature, it could be a CRM system, whatever it is. A data consumer is someone who takes in that data for a purpose. And so there you’re thinking they need data to look a certain way so they can use it to answer certain questions. And then a handover is really this interface with the two meet. So that’s where a producer is literally handing over, in this case, data to a consumer. And so there you might, think about terms like data contracts you might have quality checks, compliance, whatever it is. And so I build these networks on those three ideas really.

Shane: So the language I tend to use is I was a great fan of a TEDx video called How to Make Toast. And it’s around system thinking, so very much the same. And he talks about nodes and links. So for me, I talk about nodes and links. Node is where a job is done, something happens, and then the link is the handover to another node.

Data producer in terms of data’s created in the CRM or a sensor is creating an event that is a node and when it moves to somewhere else to get worked on or consumed, use a link to the next node. I’ll jump between your terminology and my terminology all the way through as I always do. So we’ve got this idea of producers, consumers, and handovers or nodes and links . How do we use them? ? Why do we care? ? Where’s the value?

Rob: I think the value is really in unifying other things that already exist. So lots of companies, for instance, might have an organization chart, which gives you an almost vertical view of the roles or teams and where they sit. But as we know, lots of data work happens horizontally. But there again, it’s not to replace org charts, but it helps enrich them because these producer consumer diagrams are very workflow driven. We’re trying to capture a flow from start to end. so that’s one use case where it lets you overlay these different artifacts you already have. As I said, an org chart is typically representing people. You might have your architecture diagrams, which represent technology. You might have data lineage or flow diagrams, which capture how data’s moving through these systems and. All of them, to some degree, you can represent as producers and consumers ‘ cause they’re either generating or using data. And so I think this is really powerful ‘cause it helps give you a unified view of your data estate. All of these artifacts are really good, but together they give you that workflow driven kind of diagrams which help unify everyone and reduce organizational friction.

Shane: I agree. I think the idea of using a map or a diagram to visualize a concept is really valuable as long as we don’t overload it . We don’t make it too complex in itself. So those maps should be as simple as possible. And so we can use it to map systems. We can, like you said, we can use it to map all charts.

We can use it to map flow of data. We can use it to map data layers and the rules on each of those layers. We can use it to map team design. We can use it to map a flow of work as you say, somebody works on something and then hands it off to somebody else who uses that. I think, if we go back to some of the original data mesh thinking there was this idea of input and output ports.

So we’re talking about that, somebody does a bit of work and it gets handed over to somebody else who does another bit of work, and we just visualizing that workflow or that flow. Is that how you see it?

Rob: That’s exactly it.

Shane: Let’s give a an example, concrete example. Talk me through, an example of one of these maps.

Rob: Going through an example is really useful. ‘cause there’s also this concept of you can look at different granularities and I showed diagrams of that in the article where at a very high level you might just have your external system, which is producing data, and then you bring it into internally into your architecture where you have your own processes for data.

So there we’ve got the external being the producer, the internal being the consumer. But if we look more deeply internally, we might have a data preparation layer where we ingest, do quality, create our data models, and then the reporting function, which takes that middle block, which is now consuming data and producing data. So you get this cool thing of, as you add nodes, things can turn from just being a producer to a producer, consumer. And so I think there, that’s an example where these networks are dynamic, but they also can have as much information as you want ‘em to have. So to your point, at a very high level, you might want a very few number of nodes and keep it almost like a conceptual model. And then as you go down, you might want to get into the real nitty gritty, detailed view to understand exactly for instance, which technologies are doing what, which teams are responsible where, and things like that.

Shane: I think the level of detail can vary. So for example, if we think about layered data architectures, which have been around for donkeys, but are now hotten and popular again because of the medallion architecture, we can have a very high level map of nodes and links where there’s five, source system, whether it’s produced, data’s generated, bronze , silver, gold and then the consumer.

And that’s only five nodes, and it tells a very simple story. Now, what we know as data people is there’s a whole lot of extra complexity. Where’s the data quality rules been applied as the data contracts across each of those nodes. There’s nodes within that. Whenever we get into silver, how are we conforming data?

How are we mastering data, how we bring in and reference data a whole lot of things, where are we defining those metrics? So we can take that very simple diagram and then with only the five nodes, and then we can break it down to even more detail when it makes sense, so we can present the same system with two different stories depending on who our audience is and the story that we want to tell.

Rob: I think that’s really powerful. That’s exactly right, is as you said, you might want the data engineering team to understand what goes on within the silver layer, but your business analysts who are consuming at the end, they. Don’t necessarily need to know care or even have access to the data that’s there. And so that’s exactly right. It’s powerful because you can create variance very simply, which tell different stories for the audience and it lets you again, unify all those other bits as needed.

Shane: One of the other ways we can use it is this idea of when there’s a handover or when there’s a link, what’s actually involved. So what is handed over? Is it just a blob of data? Is that a blob of data with a schema? Is it a document? What does the person hand over to the next person?

What does the producer hand over? And what does the consumer expect? What do they need to be handed over to them to do their step in that process? And a lot of people don’t think about that. We don’t bring in that lean system thinking to say, actually, where’s waste in terms of we, the producer generates some stuff that’s never used, but also waste because the consumer needs some information that’s never provided and now they’ve gotta go and do all that work again and introduce that wasted effort that’s already been done because nobody told them that it has been done or what it was.

Rob: It plays into a bigger story of kind of, in a very ideal case, if everything’s perfect, these are very unidirectional networks. You take data from source process it, it’s perfect. So it’s consumed. In reality, it’s more complex because you want to empower consumers at each handover to be able to report errors and feedback and say, you’ve given me this with a bunch of columns, that I dunno what they are ‘ cause they’re undocumented or they’re new in the API or for whatever reason. And so it’s really important to have both the producer driven flow, which is in my mind, left to right from source to target, but to have the feedback loops that go in the other direction. And so then you can have both the functional benefits of fixing errors quickly, knowing that they exist, what have you, but also from a very almost people level, remove some of the idea of playing the blame game because you’re no longer throwing things over the fence. It’s about having well-defined processes so that everyone can work together to fix these problems, and I think that’s really important.

Shane: Yeah, and if we look at it again from that system thinking, we can have different lenses on what those processes are. So for example, if we look at the flow of data, we can look at data contracts. And we can look at data contracts more than just system of capture to the people doing the data wrangling.

When we get it into that rule layer, we can actually have data contracts between every node and link. So if I’m transforming some data and creating a metric, what’s the contract for that metric? Is there a standard YAML format? Is there a certain descriptive metadata I’ve gotta store?

We can actually put data contracts in between each of those steps.

If I look at way of working, we can bring out some of the patterns from Agile and Scrum where we can do definition of ready, end definition of done. So the consumer can say, Hey, if I don’t get given these things, then you don’t meet my definition of ready.

Therefore, I’m not gonna start work because I don’t have the tools I need to do the job. I need to do. As a producer, I can have a definition of done. Here’s all the things I would expect. Myself or anybody else in my team to have done before I’m saying the work’s done, it’s ready to be handed off to the next node or the next consumer.

So we can bring that idea of gates and patterns and patent templates into both the way the data works, the way our data platforms work, and the way our teams work.

Rob: Exactly that, and that’s why I really like this idea and I think that. , It’s always gonna come up by the idea of data contracts, which is, that acceptance criteria between producer consumers for data sets effectively. But, producer consumers are really important for telling you what needs to go into a data contract. ‘ cause it’s far more than just technical as you said, because if the analysts are receiving data from an engineering team, they need and expect a certain level of documentation, formats, whatever it is. And so it is really about, yeah, removing friction and unifying those requirements and expectations from both sides. So that’s exactly why I really like these. The framework is the Pattern to use.

Shane: And I think one of the things about data contracts is everybody looks at it from a technical point of view. They look at it as what is the schema, what is the field type? What is the grain? But they forget the actual word contract, which basically means an agreement between two parties.

So it is a negotiation between the producer and the consumer about what’s needed for both sides to do their job well. that’s how we should treat it, we should actually negotiate the contract, not just treat it as a technical task.

Rob: I a hundred percent agree, and I think that’s probably improving, but at least in lots of cases, I’ve seen , data contracts are taken to be technical andnot have the surrounding pieces, which I think I agree with you. They’re missing

Shane: I’ve never seen a data contract that has the level of documentation involved in the handover unless it was an API contract. . Which is in theory self-documenting. But we don’t put, APIs between our bronze, silver, and gold layers. . We should probably we should definitely put contracts in between each of those moving parts.

We don’t put data contracts in between the way we capture requirements as a business analyst and hand them off to a data team. If we are running separate teams that do requirements gathering in a separate team, that doing bill, which again, I don’t actually suggest you do. Put them in the same team, solves a whole lot of problems.

So one of the other things you have done is you’ve used that idea of consumers and producers to actually map it to tools and technologies that are being used. So just wanna talk us through how you do that and where the value from , that Pattern is.

Rob: As I said, I view these as a way to unify these different lenses, technology people, process data. And so when we talk about, for instance, ingesting data from an external system, we’re gonna use a technology to do that. So I’m gonna use the example of within the Azure ecosystem, because that’s what I mainly work with. But any ETL tool, for instance, Azure Data Factory, you can then, say this edge of this diagram is handover, is going to be performed by Azure Data Factory. And then internally, once you’ve ingested it and gone through your medallion architecture, that’s gonna be some data warehouse, Postgres, synap, synapse analytics, whatever data platform you are using through to, once you get to the reporting stage, you’re gonna have a BI tool, typically, whether that’s Python, power, bi, Tableau. And so really the technologies in general map to the handovers. There are cases where they can be the nodes as well, the producers or consumers. And then I think what also that lets you do, again, the benefit of that is. It lets you have an informed decision about do you really need this technology? So when you’re thinking about cost, performance capability, when you’re talking about does your team have the skills to use this technology or do you have to learn something new again, by bringing it all together, you can make an informed decision about that. So it’s much less the preference of the architect or listen to a vendor pitch and it lets you get that view of what you need to use.

Shane: I think the other thing I liked in the diagram you used as an example is you also bring in the personas. So you are mapping out the flow of the data effectively. Then the technologies, the tools that are being used in each step of that flow, and then the personas, you’re expecting to use those tools.

So you are differentiating between the tools the data engineers will use and the tools are a persona of data analysts.

Rob: Yeah, and as I said, a lot of the people I work with and my clients are. Less mature with their data capability. And so for them, a lot of them start with the Google definition of data engineer or data analyst. So it’s much more useful then to use these types of diagrams, map the technology, and really distinguish the responsibilities of the personas so then they can hire the right people, get the right skills. But it also again reduces friction because if you are brought in as a data engineer and you’ve got clear roles and responsibilities, you’re not gonna kick up a FU because you’ve been given something that you deem as pure analytics or whatever the case is. And so I think again, that’s really useful to have because it gives you that unified view of your data estate and lets you make informed decisions about technology, team structure how data flows from source to target.

Shane: The thing I like about it’s, it tells me a story that, it’s a map at a high level and I can understand it, I can embarrass some stuff and I can get a bunch of questions. So the one I’m looking at, you start off by saying there’s a source system that goes across to a data lake that goes across to a data warehouse, that goes across to a analysis BI and then across to a report.

So I can look at that and go. I think you’re running a three tier architecture, data lake, data warehouse, and an analysis bi layer. And then you’ve got source systems of data coming, being produced and then reporting content on the right hand side where people go and use it. And then you are breaking it down to tools.

So you’re saying that effectively as Data Factory has been used to extract and load the data into the Lake, synapse analytics has been used to Mung and wrangle the data. Power BI has been used to create the analysis PI layer and then PowerPoint Excel has been used for the primary reporting.

And then you’ve got the idea that data engineers sit on the left and do half the work and data analysts. And so I can look at that and I can go, good. There’s no omni there’s no thought spot. . I can get an understanding quickly of the technology and I go, yeah, you’re a Microsoft stack, i’m not seeing Snowflake. I’m not seeing Databricks. , I know that effectively you’re gonna be on Azure, not GCP and not aws. aws. I can see you’ve got two sets of personas in there. And that raises a question because in the way you’ve laid it out, it looks like the data analysts are reaching back into the data warehouse.

So I can have a question of are they read only or are they actually able to write transformation code in the warehouse? Your analysis bi layer is Power bi, so I’m just gonna assume that it’s dxi. Semantic key thing, and I’m gonna ask you questions around where metrics defined. Are they only in that layer or are they somewhere else?

Your data lake, I’m gonna assume it’s Azure Data Lake Gen two or something like that even though you haven’t specified it. So I’m gonna ask you those questions so again, I can get a really quick map of that environment, of that estate, and I can ask you a whole bunch of questions.

For things that I just want to know because I need to know, or I’m just nosy and I wanna know. And that’s the value of these maps.

Rob: Yeah, and I think what you’ve just touched on is really useful in that you can create different views like we discussed about earlier. So you might have this, which is your more executive focused story, and then for your technical users, your power users, whomever. You might have that more finer grain that you’re talking, where you go into more specifics about technology, about the rules. And again, you could also go more in detail. For instance, in my experience, a lot of frustration and friction comes from the handovers is that mismatch of expectation, definition of ready, definition of done. And so this lets you account for some of those. It lets you see quite clearly who owns which parts.

Is it within one team? So the analyst bi to the report layers are both owned by the data analysts. So the part of the story there is that they should be able to do that internally. Whereas if you look at the handover from data engineering to data analytics, there, we might need some cross team and you have to work out the processes for that. But again , you’re informed of that with this diagram so you can start to flesh out how that’s gonna work. Who’s responsible for what, how communication happens. If it’s a Slack conversation, a Jira ticket, whatever you want to use for that kind of problem tracking. The extra layers of information you can get when you go to finer granularity.

Shane: And I think it’s that idea of a boundary because my eye is naturally creating boundaries and saying these things these nodes, these links are in the boundary of a data analyst, and these other ones are in the boundary of an engineer. And now I’m gonna worry about whether the boundaries look weird to me or whether there’s an the touching of the boundaries or an overlap of the boundary.

One of the things you do though, when you diagram this is you’re effectively bucketing or putting together the consumer and the producer node with no line. And I’m assuming that your visual way of saying that a person consumes something and then produces something, and that in my head is within the node before they hand off to somebody else.

Is that just a visual style of the way you do it?

Rob: It is. So it was trying to capture the logic that, for instance, the data warehouse, it consumes from the previous node, the data lake, and that same data warehouse produces to hand over to the BI tool. So again, you’re right, you can look at this and view this in different ways. For me, it’s really just about having the node label as either producer or a consumer or a producer, consumer pair. And yeah, there’s different ways of doing this. This was one of the simplest ways that I could fit quite a lot of information into a easy to read diagram. And so that’s a complete design choice that I’ve had success with.

Shane: And I think it’s just a choice of language now. I understand that when I look at those, they’re effectively an ensemble. . And I’m looking for ensembles where these consumer producers, which means these in and outs within that node, or I’m looking for one where there’s only a producer or a consumer task effectively.

Again, it helps me tell that story

Rob: And in general, obviously you very rarely want to produce data that’s not being consumed. So in typically they will come in pairs, and obviously there’s always the question of where do you end the diagram. So in the case we’re talking about, the reporting layer says consumer producer, but the producer’s not connected to a consumer. So again, the bit that’s missing from there is that there is a decision maker, an executive who takes that report and makes a decision, takes an action. And so that’s also. Where I say it’s as much art as science when you make these kind of diagrams is there’s no rule for granularity or your starting end points most of the time.

There is a logical start point, but where you decide to kinda end the flow. As I said, it can vary a lot and it really just depends on what you’re trying to convey with that particular story.

Shane: And that’s the important part is you are telling a story. So I could take , that five node one you’ve got of source system data, lake, data warehouse analysis, BI and report. And I could extend it to the right and actually say actions that are taken off, those reports and outcomes are delivered by those actions and value that’s delivered from the outcome.

If that’s the story I wanted to tell,

Rob: yeah, and then you’d obviously sell a third persona, the executive or whatever. It’s Exactly. That’s how easy it is to iterate these and see the difference between where you are. And a proposed change to your process, framework, technology stack, whatever it is, it directly gives you the implications and new connections you need to manage and prepare for.

Shane: And that’s the other part is it helps with the story. So for example, in this scenario, if I said, oh look, I want to introduce a new BI tool, consumption tool, last mile tool I’m even gonna have to find one that is compatible with the power bi semantic BI layer. If not, we know that the blast impact of that decision is we now actually have to have two semantic BI layers or replace the power BI one with one that actually serves multiple last model tools.

So again, I can point to something and say, if we do this, then we are gonna affect that. . And visually point to the map so people understand, oh, holy shit. Actually that’s one third of our data layered architecture that we were replacing. If we’ve only been working on it for six months, maybe.

If we’ve been working on it for 10 years and we can’t programmatically migrate it, we’ve got a lot of work coming. So again, it’s really valuable to be able to point to parts of the map and say, we’re talking about that. We’re talking about the uk, not the us. That’s the value of the map.

Rob: Exactly. That’s it.

Shane: And then the other thing we can do, so we can extend left and right, and we can also change the grain effectively so we can add more rows. So for example, at the moment we’re saying that all source systems are producers and they’re the same. But if I wanted to, I could create different rows,

I could say we’ve got relational versus SQL sources, or we’ve got streaming versus batch sources. So if I wanted to, I could just add more rows more nodes and links as a row to add more complexity or tell a different story to this map. Couldn’t I.

Rob: Exactly. You can add more rows in terms of different sources or even like you said, making it modular and taking out the synapse piece and say, what if I replaced it with Databricks? And then you can obviously also account for new data. So in this case, the data warehouse is just producing datasets to give to the analytics and BI tool. But you might decide that you want to also capture system data from your warehouse. So from Synapse, maybe. How long your spark clusters have been running, how many outages you’ve had, how much it’s cost. And so you can also kind of, not add new rows, but add an extra dimension where it’s not just going left to right, but you have metadata that kind of goes up and down as well. So you can add different directions or paths possible in your network.

Shane: So do you tend to. Design the complex map first and then simplify it to tell different stories, or do you tend to start off simply and then add the complexity as different versions as you go? Which way do you work?

Rob: So I do something very weird, or I start with a super high level simple flow. So I really just want to get the conceptual model right and then I skip the middle bit and jump to the perceived finest grain and do the real in depth these data quality rules, these data contracts, these technologies. Then I backfill the middle bit because I find it really easy to question the requirements at that very high level. And then by jumping to the really low level, find the real complexities. So like I said, if there’s a skills mismatch with the team versus technology, which then I can propagate upwards. I’ve got a few horror stories where I did go from top to bottom, got to the bottom, and then realized something didn’t work.

And you have to backfill the whole thing then. And even though you’re only changing one node, it inherently sometimes changes the nodes that it’s paired with, obviously ‘cause things work differently and connect differently. So that’s how I work. I think it can work both ways, starting at the very high level or very nitty gritty, detailed level. I think it’s a matter of preference and how good your initial requirements are. I find the better the requirements. The easier it is to start at the height level and then go from there?

Shane: And then you are handcrafting these, you are drawing them as if they’re pictures. You’re not using a tool to help you with this.

Rob: A little bit of both. So I do start very conceptual hand drawing in a tool like Draw io, but I do automate some of this with Python in terms of just treating it like a knowledge graph. And you can def create a simple CSV file with your nodes and edges and get it to populate a graph. So I do a hybrid approach to get these ready.

Shane: It’s interesting that there’s a gap in the tooling. I think at the moment, we used to have things like Spark Enterprise Architect, which was a horrific product to try and use. , I used to do enterprise architecture and I’d go into an organization. I was forced to use that tool. I’d if I knew in advance, I wouldn’t take the gig just made you so slow. And such waste. And like you, I just use draw io. But it’s a graph problem, we’re talking about nodes and links effectively ins and outs and relationships. So surprising that actually there’s not a great tool

For defining it, but then also creating the simple stories.

Rob: I say, I think the closest I’ve seen is mermaid js, so it’s diagramming as code. So it works, but you just don’t get the customization to make it more user friendly. So it’s great for spinning up dirty diagrams that a technical team would love, but to sell this to an exec, it falls short exactly as you’re talking about.

Shane: You’re telling a story, so therefore, it helps if the story’s attractive to look at not ugly. And yeah, I’ve done ones where I’ve used some of the lms, so I’ll put it in as a CSV, get it to gimme the mermaid and tax and then put it into a mermaid viewer. But it’s ugly,

It tells a story, but not in the most attractive way. I think the other thing is, again, overcomplicating at your risk. So if you can keep it left or right in an English speaking country, people will understand it as soon as you start branching off to if DL statements where you’re coming along and you’ve gotta go up and or down and again, that just increases the complexity of the diagram and the story you’re telling.

So if that’s the story you need to tell, then do it. I think the other thing is how many lenses or dimensions you bring to it. So for the one that you’ve got here, there’s a flow of the data and the data layers. There’s the technology and then the personas. And I find that three is normally the maximum you get to before, again, you start bringing complexity in.

So if you add another two in there, you’ve gotta do that consciously. You’re consciously saying, I want to make this a more complex technical diagram than a simple map.

Rob: Fully agree. And again, it’s all about knowing your target audience and the story you wanna tell. So again, I think these are, these are not your super technical architecture diagrams, and they’re not meant to replace them. They’re meant to compliment them by giving a view which is consumable by. Your executive team, your managers, whomever else, decision makers who don’t know what the icon for Databricks is, for instance. So when you show them an architecture diagram with networks and tool specific icons, it’s noise and this is about almost filtering that into something useful for that audience.

Shane: one of the ways I use this nodes and link format and this idea of maps is, as a workshop with data and analytics teams when we want to change the way they work. So the way it works is I get the team together. I basically put something on the left and something on the right, either on a wall, if it’s in person or on a virtual whiteboard.

If it’s not. So an example would be data sources on the left information consumers on the right. And I asked ‘em just to brain dump. Brain dump all the data sources on the left and brain dump all the people that use whatever you produce on the right. So I start giving, a bit of a map. And then I say to them just use some stickies and do a stick a node for everything that you do to get the data from that left to that.

Right. It’s always interesting. So some people do very high level stickies. Some people do very detailed. At this stage I don’t care, I don’t give them any boundaries. And then once they’ve done that, I get ‘em to group those stickies together. So effectively, where you’re doing the same task, put it in the same area.

So your idea of, where it’s in, in and out, and it looks like it’s the same. And from there we now have a flow of work. I can get them to use the dot Pattern to say, where do you think it’s broken? Where do you wanna invest and change? And a whole lot of other things. But I find it, a really quick way of getting a team to document their processes without endless interviews and documentation.

And also it’s really interesting when a leader sits in the room and they use the words, but that’s not how we do it, right? We do this and the team just laugh at them and go, no, this is always how we do it. Or when you have two teams that actually have completely different processes, there’s some things that are shared and there’s some things that aren’t kind of looking at from that lens, gimme some examples of how you actually create these diagrams. Is it just you? Have you ever done it in a teaming environment?

Rob: So there’s two ways that I’ve had good success. One is where I’ve been brought in as a almost contractor, and so they’ve given me the requirements and a few hours of someone’s time, so I go through the requirements with them and then create this myself. Then the value is really in the playback and discussion session. But like you, I’ve also had success doing a whiteboard activity where we just go through with maybe one of the data teams, the kind of technical hands-on person to create the flow. Then we go with their manager independently to create the flow. Then we go with the exec level and create the flow. And then you get, like you said, very different viewpoints of how they think things are working. And that’s how you get real change at a process level because it’s not what people thought it was. And so I’ve had great success with that. And then it tends to end with a session where we all come together, put them all up, and go through and discuss differences, similarities, and often we come up with a new truth, which is then the one that’s adopted and implemented. So it’s normally some Frankenstein monster of all of them. But it turns out that’s the one that’s useful because it brings together the different views people have and brings together the best of each way of working. And so that’s really where I think the best lessons learned are and how you actually make change using this as a tool. Yep.

Shane: And I can imagine when you’re doing that again, now you’ve got say, three maps, right? To keep it really simple. So the three expectations of how the system works to help get. Agreement. You can point to parts of the map, you can point to the executive part of the map where it’s got the word AI agent and you can say nowhere in the other two maps.

Does that have any idea of LMS or a, agentic behavior. So we need investment, we need to add that node because it’s just missing. We’re not investing in that right now. So by pointing to things in the maps, you can get agreement to add things and then you can get agreement to take things away, I’m assuming.

Rob: Yep, exactly that. And another similar example is where you might just need to end up updating their business glossary. So it might just be, and this is real case, where the ic, so the technical contributor versus the executive had different definitions of dataset, and so their flows look very different. And it’s just because. Dataset was taken for granted to mean dataset, and there we just, went into the business. Glossary added a new entry, and moving forward communication was easier ‘cause they were talking the same language. So you can both modify things and update how they, work.

Shane: Yeah, actually that’s a really good point is that when you’re doing these maps, you need to be iterating a bunch of definitions of business glossary so that if there is a box, if there’s a node and it’s got a word, that word needs to be defined. . Because otherwise everybody’s gonna look at the map and go, Christchurch, I, that’s Christchurch, New Zealand, no, it’s Christchurch in the UK endorse it. Oh, okay. Actually, it needs to say Christchurch, New Zealand and Christchurch uk. Otherwise we are gonna look at different part of the map thinking it’s the same thing. I think two other areas I’ve seen value during this process, when I get bought in to do data blueprints for organizations, I will create these as a way of articulating a story like you said. . I will help it for me to understand the thing I’m trying to map, the system I’m trying to define as a blueprint. And then use that to test and iterate and get feedback on whether I’m on the right track for what the organization thinks they want.

And the second one is when you go in and do a review or a stock take. I do exactly what you said as well. I will talk to the technical people and get them to help me draw the nodes and links diagram or get them to do it. And then I’ll go read the technical documentation, the solution design any documentation and see how well it conforms to their understanding, because that means one or two things, the documentation out of date, which is typically the case or the people have an impression of how the system’s working, but that’s not actually what’s happening.

So those two use cases , for this Pattern, I found is really valuable as well.

Rob: Yeah, I think we’ve got a very shared experience with those. The last one, which just came to mind was there’s an educational piece, which is really interesting in that when we talk about these nodes and edges, producers, consumers, as I said, I like to use ‘em for different use cases, but for instance, executive level people may not know what a data flow diagram or data lineage is, but if they’re happy with this as a conceptual map, source data warehouse. Analysis, so on. Then you can give the example, for instance of if you may instead make the nodes data sets that then if you think about it, it’s really just a data flow diagram or a data lineage diagram. And so there’s a nice educational piece where something that they’re comfortable with can help them understand something more technical that they don’t run into. And I think that’s actually never the intent, but always a nice side effect of using these.

Shane: I think we need to be careful there though, because all data lineage diagrams look awful. They look incredibly complex. They look like a, Frankenstein version of the London Underground. And therefore we need to be really careful that telling that story has some value.

Because it’s a tool for data professionals to be able to go into the detail to find the bit of nodes and links that they need to solve a problem. I think it’s like data models, and, ghost of data past the enterprise data modeler would print everything out on a or zero, have it up on the wall and be very proud at the size of their data model, the number of nodes and links and how complex it was. Nobody else gave a shit. In fact, it actually did them a disservice because it was like, I don’t understand that. And I think we’ve gotta be careful with data lineage as well. That actually it’s a tool for us for data professionals. It’s not a great map for information consumers.

So I’m with you and understand the concepts and people can look in and go, oh yeah, it looks like a really complex version of what I understand. I probably not gonna go near it ‘cause I don’t need to.

Rob: So yeah, exactly that. When I say to help ‘em understand what it is at a conceptual level. It’s not an excuse to start showing up with those lineage diagrams. However, , it does mean if they know what a lineage diagram is and consequences for contracts, observability, things which can affect cost value, have implications, it’s then easier to get time or money assigned for those types of projects because they have some awareness of it versus you rocking up out of the blue, our observability is a mess. So I fully agree. I’m definitely not saying to try and turn the execs to data lineage experts, but again, just to increase awareness. Something I say a lot is just as companies want to become more data-driven. Data engineers should want to become more business driven. And so it’s all about that.

Just, I’m not expecting data engineers to know the business through and through, but we should be aware of key metrics, processes. . Streams of revenue, what have you. And so I think it’s just about the conceptual awareness more than anything else.

Shane: I think the other thing is the complexity of the map also tells me a story. In our product, we run a relatively simple three-tier architecture. We have history, design, and consume. And then within our design layer, there is effectively three objects you can create. You can create a concept object, which is a list of keys for a thing, customer, supplier, employee product order payment.

You can create detail about it. Customer name, product skew, order quantity, payment dollar amount and you can create an event, a relationship between them, customer order product and that’s it. There’s only three types of objects you can create. And what that means is when I look at our lineage graph, it may have lots of left to rights, but the number of columns in it is very light, which means when I have to troubleshoot, I have a very short conversation with myself.

Is it the history layer? Is it one, the concept detail or event and design, or is it the consumer layer? When I go look at other organizations and, we’re creating a transformation code that has lots of crate tables temporary tables in between, and I’ve now got 16 columns for a relatively simple transformation.

That complexity comes with cost.

And we’ve gotta really understand that. And then the other thing we can do is, if we think about it if these maps become context, if they become metadata, if they become something we can query, we can actually put a boundary around a map and say, Hey, if we replace the source system, how many of those nodes and links, how many of those consumer and producer ensembles need to be touched?

Oh, 250 out of how many? Out of a thousand. Okay. So what we’re saying now is we actually have to refactor 25% of our entire state, and we can get a sense of the cost of that change, at a really high level not in detail, but we can start to really understand how much of an impact on the system we’re gonna make when we make these types of changes.

Rob: Yeah, I don’t have anything to add. That’s exactly right.

Shane: Which again, comes back to actually turning this. Context, this metadata into actual data we can use is probably something that we should think about a lot more. Because I’m like you, I just draw them. I’ve thought about automating them to make my life of drawing them easier, but I haven’t actually thought about using them as a global repository to help me make better decisions.

Rob: Yeah I think there’s something really interesting there because as I said, I see these very much as a compliment to those other artifacts that already exist. Your org charts, your architecture diagrams, data dictionaries, business ies, so I think there is a really powerful layer there that if you could bring all this together as your context layer and then yeah, use that again just as context.

I think there’s something really powerful you could do with that. I dunno, of anyone or anything that’s implementing that or even thinking about that yet.

Shane: And it comes back to ea sparks. And all those standards and the tools that complied with those standards or applied those standards, that’s what they were doing. They were bringing all these different dimensions of lenses about everything we know about an organization. The problem was the tools were just horrible to use.

They weren’t friendly. And then the diagrams they produced were ugly. So I think that’s the key, is you’ve gotta make it easy to create and you’ve gotta make it easy to consume producer and consumer.

But do it in the way you define your system as much as the way you define, the way you work, the way you define your technology, your architecture, and your data flows.

So yeah, I think that’s a good point.

Rob: ‘ cause that’s really what I try to do. I’m a big fan of Dylan Anderson’s posts about people, process, technology and data, the four pillars. And I always use these in that context of bringing those four together and. How you build a data capability or data strategy, which in turn helps you achieve your business strategy, right?

So it’s all about that kind of hierarchy of do what we can and hope that we’ve done enough for it to propagate up into something more meaningful.

Shane: I agree. So when I do the data blueprints, I focus on team design and ways of working flows of work as much as I focus on architecture. ‘cause I got sick of strategies or designs that were just a bunch of technology boxes and none of the other things that were important. The other thing I do add though, is I start up with measures of success which is, if we’re gonna spend all this money changing what we do or implementing this new platform or whatever, what does good look like? How do we measure the investment was worth it, is that increasing the number of self service?

Being done by people outside the data professionals. Is it data or information being delivered faster or with a higher quality? What does it actually look like if we are successful after we spend all this time and money? ‘cause I’m surprised at the number of people that don’t even think about that.

They just deep dive straight into the architecture map,

Rob: yeah. Whereas I think, again, I’m fully aligned with you stuff like the architecture map are almost a means to the end. It’s how you achieve the goal. It’s not the goal. And so I also start with the people and process driven part. What do we want to achieve? Ways of working to achieve it. How do we know we’re successful? And then as I said, things like then your team design or responsibilities, they then inform my architecture choices. A lot of times I’m never gonna suggest to someone who’s got two data people on their team. To try and pick up you a enterprise level data platform plus storage, plus BI, or whatever the case is.

It’s all about almost the minimal system to achieve the goal and to hit success. And that’s again why I think this just helps my thought process. Keep it simple, keep it trackable, and just have impact versus bells and whistles, which you can always add on later as and when you need them.

Shane: And again, going back to that diagram you did, I can also see, ways of identifying who we’re gonna hire. And the one that you drew, it talks about power bi bean as the semantic bi layer PowerPoint, Excel bean as the primary reporting last mile tool. Then you’re gonna hire analysts that are used to using gooeys draggy, droppy those types of things.

Maybe a lot of Excel, which makes sense going back to the DAX formulas. But if you get somebody that’s, hardcore Python. Coder who wants to just use ACL I there’s gonna be a mismatch between the system that’s in place and their expectations. Now, you can deal with that by giving them a different set of tools, but now we’re gonna have this conversation of how does the hell does a CLI with Python code talk to the semantic BI layer?

Because, sure as shit, they’re gonna wanna punch back into the data warehouse layer and use the data or even back into the lake because that’s what they’re used to, the way they’re used to working. And that’s okay, as long as you understand there’s a mismatch and you’re gonna have to change something.

But if you go in there thinking you don’t, now you’ve got a problem. And I can point to you where that problem is. It’s a mismatch between skills of your analysts and the system you’ve built. So I think that’s important. Again, taking these different maps, different dimensions, and being able to compare them.

Rob: yep. And I think as a consultant, it gives me some validity. I’ve not just given them a list of tech, I’ve not just given you an architecture diagram that you probably don’t understand. I’ve given you not just the tools, but the personas, the workflows. I’ve talked through how I went from your requirements to this proposed solution. And it again, it’s transparent. It lets you have useful discussions with people. It lets you align their priorities, whether it’s cost, people, performance, whatever it is. And that’s why I really like this. And as you said, that kind of mismatch, identifying it early, you then get to make an informed decision. I know I’ve said informed a lot. That’s what these diagrams give you. They give you informed decisions from the start before you are too committed to anything.

Shane: It’s also a decision that can be challenged because again, I can point to a box, I can point to a node, I can point to an ensemble, I can point to a line. I can point to a consumer, producer peer and say, that doesn’t make sense to me. I can point to a handoff and say, that looks like it’s missing something, or that handoff looks like it’s waste.

I can now start to challenge some of those informed decisions with an informed opinion.

And I think that’s important. Again, it becomes less conjecture. . Still an opinion, but I can actually try and get some clarity on where I’m disagreeing.

Rob: yeah. It gives you the scaffolding for meaningful conversation decision making. It’s less about opinions or people there’s still opinions, but it’s less about opinions without context and more about how opinions fit into a workflow, which involves technology, people, data, all the different parts. Yeah, so I fully agree.

Shane: One I remember back in, in the ghost of Dana past when we were doing big requirements up front, so I used to hate it, every now and again, there’ll be one of these mega projects, transformational things, and you used to get a list of requirements and number of the bloody things.

And that was effectively the input into any of the system plans and your blueprints. And I used to map the requirements to the nodes, so this node supports requirements 54 B, 27 Cs. , It helped mitigate some of the arguments that, where did this come from? But I’m not sure the, the juice was worth the squeeze. I kind of found it waste,

What about you? Do you actually map any of these back to requirement statements?

Rob: It depends on the level of the requirement statement, so I definitely don’t religiously apply it to all of them, but I think they are the requirements in general. Give me the context and that’s how I treat it in general. If the requirement is you need to ingest data from system A , that informs some choices. If your requirement is less specific and a bit more hand wave your conceptual, then I’m definitely not going to slog to try and assign it to a node. I’m going to use it as broad context so that then if when the discussion comes up, I can describe my choices in context of that. But that’s it. So I think, yeah, like you in the early days, ‘ cause this came from experience, the reason why I’m attached to this idea is I was working a job where none of this was in place and so we had to do this just out of necessity. And so I think, yeah, I’m like you, I started by very rigidly trying to. Almost one-to-one map the requirements to the flow and it doesn’t work. Or you end up with very rigid workflow and you don’t have any kind of freedom to make something better. And so now I yeah, I put, I’m selective over which ones I directly incorporate versus use context text.

Shane: I think actually as I as I was thinking about it as well I often use the requirements to identify where I have complexity in my map. For example, if I have a requirement that data’s gotta be able to come from the source system, the system of capture or production and be available in a last mile capability to a in consumer.

In less than two seconds I’m gonna have some kind of streaming architecture. And so if I look at your map, yeah. That is a typical batch orientated architecture. Like I could stream it, but I’d be really surprised if it would be streaming with those layers. So if I then have to do another flow, .

If I have to do another row on that diagram, which immediately makes it more complex and that’s only so I can stream, then I gotta justify where that came from and then potentially I could rearchitect the layered architecture. You have to be stream only. If it meets every other requirement, but now it can go back to what’s forcing me to have that complexity and is there any way I can remove the complexity without introducing more complexity?

. Because sometimes having only one row means it’s trying to do too many things. . And therefore it’s even more complex. You’re just hiding it. So yeah, I think actually thinking about it, those broad brush requirements help me again, put a boundary around things and say, I have to do this for these reasons.

If those reasons aren’t valid or important, then I can stop doing those things.

Rob: Yep. I also think. Some requirements are , very specific. And like we talked about earlier, they might only be applied to a finer grain version of this flow. So when I worked in defense, you can imagine there was quite a lot of strict rules about data encryption and personal information masking. So there, I wouldn’t worry about showing that, for instance, the high level workflow, but in the more data engineering focused one, that’s where I would include it. So again, there’s a trade off there in decision about, it’s almost twice the work, but it can be twice as impactful to have those two green views of this one system. One for the kind of exec level, one for the technical level.

Shane: But again, that helps the collaboration conversation. So if you tell me that we need to mask people’s personal identifiable information in that scenario, their names, their date of births, maybe some of their deployment information. I’m gonna ask you, where are we masking it?

Am I masking it in the lake, the warehouse, and in the bi semantic layer?

Or are we saying actually the lake can hold the raw data? When we get to the data warehouse layer, then we’re gonna mask it, which means we now need to control access to that data lake layer that only certain people with certain, authority can see the data in there. So now I have a boundary, nobody’s allowed into that layer unless they pass a certain security level. So again, they’re just helping us make tradeoff decisions and also have a conversation about what happens, where and what doesn’t. And what does the contract look like?

Rob: Yeah, exactly. And then again, it gives you that extra view, like you talked about kinda access control of, if you’re in the cloud, there is a different view of this diagram where you might talk about networking or access management. And again, the exec level probably don’t need to see that, even if it’s a requirement, you might just have that as a bullet underneath with a check mark. But then in the more technical view, you really say, we’ve made this group which has these permissions, and that’s how we’ve satisfied the requirement. Again, it’s all about, yeah, controlling the information you present and which grain you capture those requirements.

Shane: So this idea of nodes and leaks has been around for ages. This idea of system thinking, it came outta lean manufacturing. It’s been around for a long time. The idea of business process mapping and understanding the flow of work being around for ages. The idea of enterprise architectures and diagrams that hold the ability to tell different stories at different levels, been around for ages.

Why do you think in the data domain, it’s very rarely used?

Rob: It’s a great question. I dunno the answer. I think in my experience, data, people have always, for whatever reason, decided not to learn from, say, software engineering, like data engineers. And so data folks tend to be, can be very technical. They’re almost very cultish in like they do data. And I think there’s very few people relative to the number of data folks who actually understand the business and how processes work and how to communicate that. So I think one of the biggest things I learned from academia that I’ve brought to my career is communication is presenting to multidisciplinary audience audiences. And I think that’s something that it’s just skipped if you go from undergraduate degree to an internship and you just that, that piece of learning is missed a lot of the time. And so I’m not saying that’s the only other main reason, but it’s a practical reason I think that I’ve seen in my career that prevents this kind of thing from picking up direction.

Shane: I think we segment the work into hyper specialization and then we start at the lowest level. So we introduce the idea of, a data engineer that’s gonna write code using. Or we bring in, that you’re gonna be an analytics engineer that’s gonna write a model in DBT and we don’t teach end-to-end system thinking as a framework, as a set of patterns that you’ve gotta understand first before you can go and do the work.

If you walk into a factory, and again, I worked in the factory as a kid 30 odd years ago, but you walk the line, you’d understand the flow when you get onboarded, the flow of work from the beginning of the factory to the end so that you understood where your station was what your part in it was.

And I don’t think we do that in the data world. I think the other thing is and one of the reasons I wanted to get you on is that article you wrote, it was simple to understand, for me at least. The thinking is really aligned to the way I think, and that probably helped. But what I often find, especially if I look at academic writing.

It’s really research based. It’s quite technical. It’s lots of complex ideas that aren’t distilled in the story I can understand. , I read them and I’m like, ah, I don’t get it. And so to take that complexity and write it with simplicity is actually really hard. Again, big ups to you for writing an article that distilled what is quite a simple idea, but can be complex down into something that’s easy to understand in the written word.

Rob: Yeah, I think part of that comes from the kind of, even though I’ve jumped fields a lot in my career, the one consistent has been an interest in mentoring and developing others. And so to do that, you really need to be able to make information accessible, whether that’s just choosing the right type of diagram, using the right vocabulary, whatever it is. So a lot of these ideas that I am drawn to. Yeah, it’s often something which I’ve not seen someone else explain quite simply. So I take up the challenge to do it because I think it’s has value to even if only, yeah, two or three people read that and say, I now get it. I’m happy with that.

That’s, that’s worth it. But I think that kind of thinking as you said, the systems thinking, the context, being able to understand how these systems come together, the modular parts that they’re made up of, and then how to communicate that not just between technical teams, so not just between data engineers, but also data engineers and data analysts, but also data engineers with managers with C-Suite. That is I think, one of the biggest gaps. And especially now with, lots of buzz that you see about, we don’t need junior engineers. There’s not just a technical deficit there just because you can outsource some work to LLMs. There is a real issue of new starters, not picking up in quotes, the soft skills, communication skills, and learn how from the start to talk about these ideas, to communicate these ideas. And I think that’s something which is a concern for me and something which I’m podcasts like this, I think do a really good job of starting to address the gap of giving people another avenue to learn something like this quite easily.

Shane: Yeah, I think there’s the whole argument right now about, role of the junior or the role of the senior when we all get 10 xd it’s gotta be interesting because I think education has to change. Because the skills that we learn, the technical skills of how to code are gonna be supported by, the tools of the future, which actually just leaves us the system problem, actually understanding how to daisy chain those tools and that code together to become far more valuable.

But just go back to that writing process. I find it relatively easy to write complex words, I can just brain dump and I can just write lots of complex stuff. But to then refine it down into something that is clear and simple and reduces the complexity and increases the cognition when you’re reading it.

I find that the hard effort, that’s where I’ve really gotta focus and iterate time and time again and spend my time. Is that what you find, or is it like the way you do your diagrams, you find it slightly different in terms of the process you use?

Rob: No, I think it’s quite similar. Again, from my academic work, I would write quite technical, scientific papers, but then I would convert those into PowerPoint presentations for conference presentations and stuff. And so I theory writing stuff like this the same way. So I go from a technical idea that I understand and then I always start by creating the diagrams that capture what I’m thinking and then the words I can just naturally fit around that. ‘ cause once I have the core concept in images, I can just walk through the process of what do you need to know to understand this image? Then what do I want you to take away from this image? And so that’s always my writing process of taking some complex and making it more accessible.

Shane: I think again, that idea of writing some complex words and then drawing yourself a simple map and see what’s missing or what needs to be added, that helps that visual to written and back and forward

Rob: That’s why I always have a notepad on my desk. ‘ cause I always find, even if I’m reading or learning something, if I can draw it out, I can understand it because I’m a very visual learner, like you said. I think so, yeah. That’s how I go through this.

Shane: So if people wanna follow you and read what you’re writing and get some more of these cool ideas, these cool patterns in a way that’s simple to understand, how do they find you?

Rob: So the two best places to find me are on LinkedIn. So Robert S. Long, I think is my handle. And on Substack at Long last Analytics. So that’s the name of my consultancy, my surname’s long, and I like cheesy things, so I like the at long last, I’ve solved your problem aspect, so that’s where you can find me.

Shane: I’ve been reading your stuff for a little while and before you mentioned that, I only just got the joke as I was, as we’re doing this podcast. I was like, ah, actually, hold on. That’s your last name. So well done that. I like cheesy as well, but normally I pick it up a lot quicker than that.

Excellent. All right anybody wants to read what Rob’s writing go to at long last Analytics on Substack hook him up on LinkedIn. Otherwise, I hope everybody has a simply magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Patterns to define the ROI of a data product with Nick Zervoudis

Shagility — Thu, 04 Sep 2025 06:24:00 GMT

Join Shane Gibson as he chats with Nick Zervoudis about patterns that you can use to quickly and easily define the ROI of your data products, before you build them.

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/patterns-to-define-the-roi-of-a-data-product-with-nick-zervoudis-episode-74/

Listen to the Agile Data Podcast Episode

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

You can get in touch with Nick via LinkedIn or over at:

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Briefing Document: Quantifying ROI for Data Products

Source: Excerpts from "AgileData 74 - Patterns to define the ROI of a data product with Nick Zervoudis" Speakers: Shane Gibson (Host), Nick Zervoudis (Guest, Independent Consultant and Trainer, Founder of Value From Data and AI)

1. Introduction: The Challenge of Quantifying Data ROI

The podcast episode highlights a common and significant problem in the data domain: the difficulty in quantifying the Return on Investment (ROI) for data projects and products. Organisations often struggle to move beyond identifying potential actions and outcomes from data to actually assigning a monetary value to those outcomes.

Core Problem: Stakeholders can describe the desired action and outcome (e.g., "reduce customer churn," "increase sales"), but "then you get crickets" when asked to quantify the financial impact.
Nick Zervoudis' Background: Nick has a career in data, bridging the gap between technical and non-technical people, initially in consulting (Capta Invent) and then in data product management (PepsiCo, CK Delta). He's now an independent consultant, emphasising "value from data and AI." His experience spans internal and external data products, including data platforms, analytics products (dashboards, CSVs), and machine learning applications.

2. The Shift to Data Product Thinking and Value-Centricity

The speakers note a growing, but still evolving, trend towards applying product management principles to data. This "data as a product" approach is seen as crucial for addressing the ROI challenge.

Product Thinking for Data: "It's interesting that there's this move in the last couple years to bring product management thinking or data as a product that way of working from the product domain into the data domain. And I think it's been great. I think we've seen some real changes..."
Value and Customer Centricity: While some companies have embraced this for 25 years, many are "lagards" slowly adopting "value centric and customercentric way" of working with data.
Moving Beyond "Feature Factories": Data teams often act as "feature factories" or "data request" fulfillers, building what stakeholders demand without understanding the underlying problem or value. This leads to unused dashboards and wasted effort.

3. Key Strategy: Fermi Estimation for ROI (Back-of-the-Envelope Calculations)

A central theme is the importance of using quick, rough estimations – "Fermi estimations" – rather than striving for perfect precision at the outset.

Fermi Estimation: Named after Enrico Fermi, who made quick, order-of-magnitude estimations (e.g., the TNT equivalent of a nuclear blast). The goal is to get the "order of magnitude right," not exact numbers.
Simplicity is Key: Data professionals often overcomplicate ROI calculations, thinking they need "exact numbers" and "the same rigor as a lot of the other data work." Instead, "a lot of the time all you need is a back of the envelope calculation."
Example: Churn Reduction: If stakeholders want to reduce churn, even rough estimates of customer lifetime value, churn rate, and potential reduction (e.g., 10%) can quickly reveal if the opportunity is worth $500, $5,000, or $500,000.
Prioritisation Tool: These rough estimates allow for quick comparison of many opportunities (e.g., 15-150 ideas) to identify the most valuable ones, or those with the highest value per unit of effort.
"S*** First Draft" Approach:** Instead of asking stakeholders for a number on a blank sheet, provide them with a "s***** first draft" of your calculation. This makes it "so much easier for both technical and nontechnical stakeholders to basically critique something or provide me with an input I'm looking for if I give them the scaffolding."

4. Collaborative Approach and Stakeholder Engagement

Quantifying ROI and building effective data products is not an isolated task for the data team; it requires deep collaboration with stakeholders.

Internal Consulting: Nick's experience in consulting and acting as an "internal consultant" for business units has taught him the value of asking "dumb questions" and drawing flowcharts with stakeholders to understand processes.
Building Trust: Constant engagement and collaboration build "better relationships and trust." Disappearing for months after initial requirements gathering, only to return with a "homework" product, often leads to building the "wrong thing" or stakeholder resistance.
Bringing Finance Along: Involving finance colleagues in the financial estimation part ensures that "when the business case shows up on their desk they don't go who is this what is this thing," but rather recognise it as something they "worked on together."

5. Understanding the "Why" and the "Trifecta" of Benefits

Data professionals should push beyond simple data requests to understand the underlying business problem and how data can contribute to key financial benefits.

Beyond Data Requests: When a stakeholder requests "weekly sales data," Nick's response is, "no that that's not what we're here. That that's just a solution you have in mind." The data team should act like a doctor, probing symptoms to understand the actual problem.
Focus on Business Outcomes: The goal is to understand "the business problem" and how data can influence "the business outcome."
The "Trifecta" of Value: Most benefits can be categorised into:

Cost Saving: Reducing operational expenses.
Revenue Improvement: Generating more sales or income.
Risk Reduction: Mitigating potential financial or operational risks.

Probing Questions: By asking "what action are you going to take?" and "what outcome do you think you're going to deliver?", data professionals can uncover the true need.

6. Metric Trees: Visualising Business Relationships and Sensitivity

Metric trees are presented as a valuable tool for understanding the interconnectedness of business inputs and outputs, enabling more informed decision-making.

Understanding Relationships: Metric trees help to "understand the relationship between the different inputs in my business and how these translate into outputs."
Business Sensitivity: They reveal "what is my business's sensitivity for those different things." For example, how a 10% increase in mailing list subscribers cascades through click-through rates, conversion rates, and profitability.
Simplicity for Stakeholders: While the underlying calculations might be complex, the visual representation and the "output that a business stakeholder sees has to be super simple so that they can also understand this whole concept of making datadriven decisions."
Avoiding Over-Engineering: Data professionals' tendency to seek extreme accuracy (e.g., "spend 3 months grabbing it, modeling it, getting the actual abandonment rate") can delay value. Metric trees support the "light touches" of the discovery/ideation phase.

7. Measuring Success and the Measurement & Evaluation Workstream

Proving ROI requires a deliberate plan for measuring the impact of data products, ideally integrated from the project's start.

Pre-emptive Measurement: "It's so much easier to actually figure out the ROI of something if we've done this exercise that goes, what is the business outcome we're going to be influencing here?"
Dedicated Workstream: For significant projects, Nick recommends an "insist that there needs to be a measurement and evaluation workstream as part of the project."
Defining Success Metrics: This workstream defines "what are the metrics for success." If the necessary data isn't readily available (e.g., in a metrics tree), "we need to set up some kind of measurement for this new thing."
Beyond Usage Metrics: Simply measuring dashboard usage (e.g., "opened and run") is often insufficient. Qualitative feedback (interviews, surveys) is "so much richer" than viewing time or open rates.
Linking to Action: True value comes from enabling "better decisions" and influencing specific actions. Dashboards should be integrated into workflows (e.g., "every Monday morning I open this dashboard... and I make one two three actions off the back of it").
Deliberate Data Collection: The "big data data lake approach" often fails because crucial data points are missing. Being deliberate about the business problem helps identify necessary data points, and if they don't exist, "we need to create those data points. It's not a nice to have, it's a must-have condition."

8. Balancing Foundational Work with Value Delivery

The discussion touches on the age-old tension between building robust data foundations and delivering immediate business value.

Avoid "Platform First" Pitfalls: "We spend two years doing a big migration promising the business that after the big migration we'll finally be able to deliver value and what do you know two years later co gets fired new co comes in first thing they do they want to rebuild the platform..." This is a common and detrimental cycle.
Bundle Value with Foundations: It's crucial to "bundle any kind of let's call it technical debt or platform investment or foundational investment together with something that's going to deliver value to the business and deliver in minimum increments of value."
Intentional Technical Debt: Technical debt isn't inherently bad; it's "borrowed from our future selves... in order to usually test something." There's "no point building something super robust and scalable if we don't know it's worth scaling in the first place."

9. Quantifying the "I" (Investment) and Prioritisation

Understanding the cost side of ROI is equally important, particularly for internal prioritisation.

Beyond Value: ROI requires both value and investment. The "I part is basically the cost." This includes incremental costs (additional hours, contractors, compute) rather than sunk fixed costs.
Internal Accountability: Data teams should know their operating costs and aim to "be delivering more than that, like a multiple." (e.g., "If we're costing 100K a week, then any given week, we should be delivering at least 110 if not 200K back to the business").
Using Financial Language for Prioritisation: When prioritising, use "financial numbers" to justify decisions to stakeholders. For example, "your thing is going to cost the business 200K, but based on our projections, it's only going to make us an extra 100K."
Data Product Manager's Role: While committees often prioritise based on "who has got the biggest voice," a data product manager should ideally make the final decision based on value, involving stakeholders in the process.

10. Data as a Value Driver, Not Just a Cost Centre

The speakers challenge the notion of data (and even other shared services like HR/IT) solely as cost centres.

Opportunity Cost: Treating departments as cost centres can make "a lot of things become invisible to the business," particularly opportunity costs (e.g., the cost of using inefficient old software).
Innovating and Unlocking Value: Data is a "more nent profession" that helps "innovate," "improve the quality of decisions," "unlock new streams of revenue," and "build new products," especially with the rise of AI.
Avoiding Commoditisation: Nick doesn't want the data team to act like a cost centre because "then we're just going to default to doing bare minimum low value adding tasks that can be commoditized."

11. Qualities of a Good Data Product Manager

The episode concludes by identifying key aptitudes for successful data product managers.

Ownership: The most critical quality. Being "invested in the outcomes you're trying to enable," not just completing tasks. Good PMs "fill in the gaps" across technical, marketing, or financial analysis areas.
Curiosity: A "sense of curiosity to learn more about your users, about your business, about the technical underpinnings of your product, about what the data actually shows and means." This prevents becoming a mere "information sifter" and enables proactive, strategic impact.
Problem-First Mindset: "Be inquisitive around what the problem is understand the problem itself before you worry about the solution." This aligns with "product thinking" and "jobs to be done" frameworks.

Key Takeaways for Action:

Embrace Fermi Estimations: Don't strive for perfect accuracy upfront. Use quick, "back-of-the-envelope" calculations to get an order of magnitude for ROI, especially in the discovery and ideation phases.
Collaborate Extensively: Involve stakeholders (including finance) from the start. Share "s***** first drafts" of calculations and co-create understanding of processes and value.
Focus on Business Outcomes: Always ask "why" and link data requests to specific actions, measurable outcomes, and the "trifecta" of cost saving, revenue improvement, or risk reduction.
Implement Measurement & Evaluation: For any significant data product, build a measurement and evaluation workstream into the project plan, defining success metrics and how they will be tracked.
Balance Foundations with Value: Bundle foundational data work with initiatives that deliver tangible, incremental business value, avoiding lengthy "platform-first" projects.
Quantify Investment: Understand and communicate the cost of data initiatives alongside their potential value to inform prioritisation decisions.
Cultivate Ownership & Curiosity: For data professionals, especially those in product roles, these aptitudes are crucial for understanding complex problems and driving impactful solutions.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I'm Shane Gibson.

Nick: And I'm Nick Zervoudis.

Shane: Hey, Nick. Thanks for coming on the show. Today we are going to talk about the thing that I never actually see, which is quantifying return on investments from data teams. Before we rip into that though, why don't you give the audience a bit of background about yourself self.

Nick: Yeah. And it's great to be here. So I'm Nick. I've worked my whole career in data, but always in that kind of squishy role that somewhere between the purely technical people and the people that haven't looked at an equation since about 25 years ago, at first, that was in consulting. First for a boutique consultancy, specialized in data science.

Then for Capgemini Invent. Then I made the leap to product management. I worked for PepsiCo and then for CK Delta's a. Subsidiary of a large conglomerate owning a lot of infrastructure businesses around the world, and now I'm on my own. I made the leap to be an independent consultant and trainer, , specialized around data product management because I see so many organizations still make the same , really basic mistakes around how should we get value out of our data investments, which is also why I've named my company value from Data and ai.

That's the short story about me. I'm someone that studied politics, but somehow has been in data his whole career. And even though I never considered myself technical, I've now gotten to the point where I'm often introduced at least by the business stakeholders as, oh, Nick's the technical guy.

He understands the data side. Except when I then talk to the data scientists, then, I don't think they think I'm a clueless person that doesn't understand anything they do, but that's how it feels compared to them. So here we are

Shane: Yeah, I'm a big fan of Hitchhiker's Guide to the Galaxy, so I talk about people like that as Babelfish, depending on which part of the organization you're talking to, you're interpreting for the other part , that's not in the room normally. When you're in product management for those companies, was it physical products or digital products .

What part of the product management were you in?

Nick: always data products. And here when I say data product I'm mish mashing a few different types. Like you've got data platforms that are enabling other use cases. They're not necessarily delivering value end-to-end themselves. Data and analytics products, things like. A dashboard or shipping out CSVs to end customers.

All the way to using machine learning and olms. And that's been the types of products I've managed my whole career, including when I was in consulting and was an undercover data PM and we didn't call it data pm 'cause that title didn't exist and we didn't even know that product management was a thing.

And then the other thing that's worth calling out is that I've also dabbled in both internal facing products where your customers are colleagues of yours in different departments and external data products where we're literally selling data sets or selling dashboard access to customers that are interested in buying that data.

Shane: I think it's interesting that there's this move in the last couple of years to bring product management thinking or data as a product. That way of working from the product domain into the data domain. And I think it's been great. I think we've seen, some real changes to the way we work with data and the processes we use and everything we do. Still quite lacking. There's still a lot to learn and a lot to bring to the table. And one of the things, that I wanted to talk about is this idea of return on investment because. When I'm working with organizations, it's hard enough to get them to get to the stage where they'll determine the potential action that will be taken from the data or information that's gonna be delivered and what the outcome to the organization might be from that action. But it's very rare, if ever, that I can see them make the next jump to actually quantify the return on investment for that outcome. They may say, I need some data. I wanna understand all the customers that haven't bought something in the last six months. And you say, great. If I answer that question with data and information, what are you gonna do? Ah we know that they're likely to leave, so we're gonna go and do a churn offer, send out some emails, give them a discount. Okay, if that churn campaign is successful, what's the outcome? Oh we'll retain some customers, we'll increase our margin, get some more sales, and you go, great. If you had to quantify that as a number, what would it be? And then you get crickets, and. Maybe it's because they don't actually have the data to input to help them make that decision, and understand what that would be. But there just seems to stop, so how have you dealt with that, this idea of return on investment. How do you actually approach it?

Nick: Yeah, it's a great question and I wanna answer it by going on a bit of a tangent first because I wanna comment on even what you said that, oh, in the last couple years, companies are waking up to the need for product thinking. And it's true, like a lot more companies are doing it. And then there's other companies or other data teams that have been doing it for the last 25 years.

So I feel like there's a very skewed distribution. In terms of which companies are doing things in that, value centric and customer centric way versus all the laggards in the same way that if you look at how many people are using maybe LLMs in production today, it's a very small number of companies around the world.

If you looked at it two years ago, it was an even smaller number. And we're in that stage where the laggards and even the people in the middle of the adoption curve are slowly getting into it. And I think that's relevant because for me, a lot of , these let's say, best practices and now how do we figure out the ROI, it's not something that's super complicated or cutting edge or, oh, you need to have all your ducks in a row, all your data in a beautiful metrics tree before you can start asking those questions.

Because let's jump into your example, right? I thought you'd give me a harder case to start with, right? From what you've said, it sounds pretty straightforward.

Shane: let's start with the simple ones,

If we see the how you can apply the patents to a simple scenario, then it gives you the sense of the pattern. And then from there we can go into the horrible edge cases where we know it's a little bit harder. But again, I still find people struggle with the simple cases.

Nick: So here's why. And I'm, not just saying it to be smug or anything, but why I think the example you gave me is straightforward because the link between the data request, let's say, and the outcome, the business outcome that the customer is looking to achieve is pretty straightforward, we wanna reduce our customer churn, that means that even roughly, , I'm sure even if their BI is not very good and there's some data quality issues, they'll have some idea of, okay, how many customers are churning today? What is our average order value? What is our transaction frequency?

And therefore, based off those numbers, I know my customer lifetime value roughly. I know my number of churn customers roughly, and I can go, okay, what if I can reduce that by 10%? What is that worth? Is that worth $500? Is it $5,000? Is it $500,000? For me, that's the starting point.

And I think the mistake we make in terms of these kind of ROI and value estimations as data professionals is we think it has to be something that, we're gonna calculate with our Python notebook that we need the exact numbers, that it needs to be precise, and it needs to have the same rigor as a lot of the other data work that we do, when actually a lot of the time all you need is a back of the envelope calculation,

I like using the term Fermi estimation, named after Enrico Fermi, who allegedly just before the Trinity nuclear test. Whipped out on a piece of paper, his estimation of what's the TNT equivalent of the blast that they were about to witness, and he got it. I think it, it was, he estimated 10,000. It was 24,000. I've probably missed the exact numbers. The point is he got the order of magnitude right? He was just off by a factor of two. And for these ROI estimations, it's the same thing. It's I'm not looking to understand is this gonna make us 600 K or 650 K or 700 k?

It's more, okay, this opportunity, roughly speaking, is in the 500 K to 1 million, and this other one is in the 10 K to 20 K. And this optimization, one of my engineers wants to do, to bring down costs is gonna save us $500 a year. And so if I look at those three examples, it becomes very easy for me to go, okay, roughly speaking, which of these is more valuable?

Or maybe more valuable per unit effort we're gonna spend. 'cause maybe the optimization will take one day and the trend modeling thing will take, I don't know, one year. And so that, that is a very good starting point. And then the other thing I wanted to comment, 'cause you say, oh, I ask my stakeholders, how do I go about doing this or give us a number and they're like, oh, I don't know. Do you know what makes it so much easier for them to give you a number? If you make that back of the envelope calculation, super basic, you sketch it out on a slide, on an Excel sheet, whatever, and then you show it to them. You're sharing your screen or you show the piece of paper and then they go, oh no, that's not right because this is not our lifetime value.

Or, oh no, that's not right because whatever other assumption you've made that's wrong. I found that it's so much easier for both technical and non-technical stakeholders to basically critique something or provide me with an input I'm looking for. If I give them the scaffolding, I'm like, here's the shitty first draft.

Now you tell me what's wrong. Instead of, Hey, here's a blank sheet of paper, please fill it in. Even if the blank sheet of paper is like a template for them to fill in, it's still harder to get an answer from them compared to, here's my wrong assumptions. Now, correct them.

So much easier if I go, Hey, here's the rough firm estimation I've done where I've assumed your lifetime value is this order times, this order frequency, and then that your churn rate is this much and that your growth is gonna be like that.

And then becomes real easy for them to start giving me an answer, because you said, oh I asked them, and how do we quantify that? And they don't know. I think a lot of the time it's that maybe when we ask, especially a non-data savvy stakeholder, that kind of question, they might think that we're asking for something much bigger, oh, these data guys, they're here to do smart analytics and statistic stuff with their fancy PhDs and computer science degrees and whatnot. It's no. Actually it's, some of these models are super simple. And for me, what I love about this approach is that it also means that it's so easy to do really quickly for a large number of opportunities, right?

I considered, let's say we've got 15 ideas or 15 requests or even 150. And if you can spend just a few seconds for each one to figure out, okay, roughly speaking, this is gonna save three hours a week from this employee who on average gets paid $50 an hour. So the value of this automation is this much, versus this is a cost saving that'll save this much.

And sometimes you might be missing one of the variables, right? Like in the churn model, it's okay, will this reduce churn rate by 5%, 10%, 1%? And you can just put a number that feels reasonable and what I found is usually that rough, 32nd estimation. Is almost exactly the same as the two week version of that estimation where we build the prototype churn model and we estimate what it's gonna be when it's in production and we test a bunch of different data sets.

And then all those times this happened in that order, I felt a bit silly when my manager may have been, look just plug in, 10% uplift. And I was like, no, but where did you get that number from? What if it's not 10%? And then I, come back two weeks later. 'cause I was insistent that this is not a case that we can just do a firming estimation and come back.

And actually it was pretty much the same number. I was like, damn, okay. Not necessarily wasted two weeks, but wasted two weeks.

Shane: I agree with that. I mean, One of the things I say is when we work on the canvases we always get asked for how long is it gonna take to deliver that product. And yeah, typically we are doing this discovery really early. So any. Detailed estimation is waste because let's face it, humans are really bad at estimating anyway. But even if we weren't, we're at the discovery stage, we're still ideating, like you said, there could be a hundred ideas going and spending all that time estimating how long each one of those is gonna take 'cause waste at this level. We haven't prioritized that. We haven't said this is the top five so let's not worry about it.

Let's just do a quick t-shirt sizing, pull a number out your bum, the number will actually be quite good 'cause we're good at guessing and then move on, and we can do more detailed estimates if we have to at a later stage. So I'm with you on that. Do it light, do it quick and use it where the value is. But to do that is quite a skill. So if I think about it, you have to understand how data works, how effectively metrics works, and potentially metrics trees like you mentioned. And we have to understand the. , Organizational processes. And we have to be able to combine both of those to be able to quickly articulate how lifetime value works or how a churn number will work and what the impact of that is.

And that is a skill, and it's not a skill that I see a lot. What's your thoughts? Do you see it a lot?

Nick: I think for sure being comfortable doing it. It takes some practice, like anything else, and it's gonna feel harder, even at least just mentally. But for me, it comes naturally because I've spent most of my career either in consulting or in data teams where we acted like internal consultants for our different business units.

So I'm very used to not really knowing very much about a domain, and asking a lot of dumb questions to my stakeholders about, Hey, how does this operation actually work? I draw the flow chart as we go. Maybe I'm even sharing my screen as I'm drawing out a process that a stakeholder's describing and they go, oh no, you forgot about this.

Or no, I forgot to mention about that step. Or actually, this part is more complicated. 'cause when A happens, we do B, and when C happens, we do D or whatever else. So the point I'm trying to make here is this is not an exercise that a data professional needs to do on their own . It is a collaborative exercise we need to do with our stakeholders for a couple of reasons.

Number one, because like I alluded to, we don't have the full picture. Even worse, we might think we have it based on the explanation we were given. And actually it turns out our stakeholders didn't mention a whole bunch of other details that were really important. But also, secondly, for me, it's also about building better relationships and trust with our stakeholders, i've seen this happen so many times when, and it's in data or more tech more generally, where the kind of tech team comes, they have some kind of discovery workshop and then they disappear for six months and then they show up with the thing they've built and they go, here you go. Please test it for us.

And one, what happens then is you often end up with having built the wrong thing. Because again, the picture you got during that initial requirements gathering exercise was incomplete. But also too, let's say you actually nailed it, right? You built exactly what was needed. You estimated the value potential perfectly.

Then that stakeholder turns around and goes who are you? Who are you to tell me that I should be using this dashboard? Now I've been doing this job the way I've been doing it with Excel for the last 20 years. So it's also about bringing our stakeholders with us on the journey, and that's just as true about the financial estimation parts. That's maybe we need to bring our finance colleagues along the journey, or the client's finance colleagues so that when the business case shows up on their decks, they don't go, who is this? What is this thing that these consultants or that the data team wants to do again?

They go, oh yeah, this the thing that Shane and I worked on together when I gave them the numbers from the budget to plug into the business case. And then I learned what a, I don't know, random forest algorithm is. 'cause I was curious and it becomes so much easier to work together. And the same is true on the user facing side.

Jono is a slightly different topic, but for me it's actually conceptually the same thing. It's so much better to build together with our customers, be they external or internal, than to do something on our own and then show up and say, here's my homework. And then you find out that the homework is wrong or that they just don't like the font you've used. 'cause that's not the font they're used to.

Shane: I agree. I think that constant feedback helps us iterate and, figure out where we've heard wrong or where they forgot to tell us something. Or as you said, if you wait six months, something's gonna change anyway. That may be the most burning problem six months ago, but there's a real big chance that when we go back with the answer, six months later, they've moved on.

They've fixed it with Excel, and there's something else far more important, so it's no longer top of mind. So that idea of, constant engagement and collaboration is so much value. And one of the things I think you were just talking about is this, back to old school, almost business process mapping, this idea of nodes and links and saying to the organization how does the process work? Tell me something happens and then what's next? And let me draw a circle and say, this is the thing happening and here's aligned to the next thing happening. And from there we can start identifying those measures that can form that return on investment. , And one of the things I think you talked about was, looking at it from, cost saving, revenue improvement. And there's always a risk one as well, that's the trifecta that you can use is, are we gonna save money? Are we gonna make money? Are we gonna reduce risk?

That's the three that they always tend to come back to in my view. So I think, yeah, combining all those patterns together is really valuable.

Nick: A hundred percent. And for me, exactly that trifecta is the starting point of what is the benefit we're trying to influence at the end of the day, because when someone says, Hey, I need a data set showing weekly sales. I'm like no, that, that's not what we're here. That, that's just a solution you have in mind.

You're asking the doctor to prescribe you specific medication, and then just like the doctor goes, okay, I understand, you went on WebMD and you've self-diagnosed that you have this, but. Let's just check your symptoms to be safe. For me it's the same thing, where we go, okay, let's understand the business problem.

And this is for me, super simple thing, but that so many data professionals get wrong about it. They see any request that comes from stakeholders as a, an order a command. And a lot of the time, one it absolutely is not like that. And the person making the request is clueless about what they need and they're coming to you for help, but maybe they've also not learned the right way to do that, to say, Hey, our sales are down.

I need to figure out why. I have a hypothesis that maybe it's because one of the regions is underperforming. That's why my request on the Jira ticket said, give me a sales breakdown by region. But then when they ask that you can start asking more probing questions to figure out, okay, what is this hypothesis exactly?

How can we test it? Maybe actually there's an element of, statistical know-how that this test is gonna require that a dashboard with a line chart is not gonna solve for the stakeholder. Maybe we can test alternative hypotheses in parallel that, if you know how to write code and to use different models is maybe trivial versus you make the dashboard, it spits out the line chart, the stakeholder looks at it, doesn't see an obvious pattern, or thinks this is gonna be too hard to make sense of, and they give up on it.

And you've ended up with dashboard number 952 that no one in the business is ever using.

Shane: There's a couple of things embedded in there. So one is, if we look at product teams and the software industry or software domain often they talk about feature factories. Where somebody comes and demands a feature doesn't tell you what it's gonna be used for. Our equivalent is a data request, Here's a data request, give me the data. Don't ask me what I want it for. I still blame Jira for both. 'cause basically both of those problems are managed in stupid Jira ticketing systems. I think the other thing you mentioned is, , a doctor saying you've gone and Googled WebMD or whatever. I think we're hitting that new world, right? Where actually our stakeholders are gonna LLM the answer and come to us with a data request that's based on a AI bot telling them what the answer is. So it's gonna get worse before it gets better. , But one thing you mentioned was this idea of metric trees and, I'm old people mentioned now that I bring up ghosts of data pass and I remember a while ago we were doing balance scorecards cause and effects.

This idea of saying if we have a bunch of things we measure and we understand the business processes, then effectively we can see some causation or relationships between those metrics and where we see those causation or those relationships has some value, we can infer some things and help us make better decisions. What's your view on that is metric trees another. Good thing, but just a reincarnation of things we've done before or is it something different?

Nick: I guess 'cause one, I've not been around this game for as long and two, I wouldn't call myself, someone that's gone super deep into metrics trees and understands them deeply. So I'm probably gonna give you a very lay person's answer. But it's that, yeah, a lot of the core concepts are not new, 'cause for me, fundamentally it's just about saying. I need to understand the relationship between the different inputs in my business and how these translate into outputs and to basically figure out what is my business's sensitivity for those different things, so for example, let's say I've got three potential initiatives I'm considering.

One is I want grow my mailing list. The other is I wanna improve my checkout rate, and the other is I wanna optimize my pricing to make more profit per sale. If you just tell me those three things, unless if I have a deeply intuitive metrics tree in my head, I'm not as the business owner gonna automatically know which one of these is gonna make the biggest difference for my business, whereas if I have a, forget about the fancy terms, if I just know , that flow. Of, okay, how many people do we have in our mailing list? What is our click through rate? Every time we send an email, , and then of that, how many people that land on our website actually go and convert?

And then what is the average profitability of our products? If I have this information somewhere, and if I know , those relationships, then I can plug in different assumptions and go, okay, what if I were to increase my mailing list by 10%? How would that cause a cascade in all the other numbers?

Okay. If you get 10% more people in the mailing list, because you actually have a super low click through rate and most of your traffic comes from other sources, your revenue would only go up by 0.1%. Whereas if you were to improve checkout rate because you're getting traffic for all these other sources, but your basket abandonment rate is quite high, actually, that would translate into quite a bit more revenue.

Then lastly, if you know the average margin you're making per product, if you were able to increase prices for certain strategic items, actually that on its own would lead to more profit than the other two things combined times 10, made up example. But for me it's more about the high school economics , not even high school, more like middle school mathematics that you might have to do.

When we did the few exercises around firming estimation with my students as they were doing it, I was like, oh my God, this feels like I'm an elementary school teacher. 'cause I'm just asking these guys to literally, they're doing multiplication. There's nothing more to it. I've given them all the assumptions and all they have to do is pick which numbers to multiply where.

And that's how I want the data team to show it to their business stakeholders. Even if there's a lot of complexity behind how we calculate basket abandonment rate or margin or whatever else. At the end of the day, the output, especially the output that a business stakeholder sees, has to be super simple so that they can also understand this whole concept of making data driven decisions.

That's ultimately what we're trying to do. I think metrics trees are great because even just visually, they help us understand that relationship between different metrics instead of, here's our metrics report. It has multiple tabs and lots of charts, and you basically can't really figure out how to mentally connect the dots between all these things unless if you have that metric strain in your head.

Shane: What I can see is data people. We love the detail, so as soon as you say card abandonment rate. I can imagine a data person going we can't just use an estimate for that. 'cause we know the data's there, so we'll just go and spend, three months grabbing it modeling it, , getting the actual abandonment rate.

And then that'll make our firm aim model so much more accurate when we try and determine the ROI and I can see that logic, but then it's about time to market. It's about, again, we don't know that is the most important information or data product to build next. We're trying to use this technique in the discovery ideation phase not furthered down, and therefore light touches have value. But do you find that, do you find that data people naturally want to go and grab the data, do a whole lot of work, and make it as accurate as possible?

Nick: For sure. And I fall into that trap too sometimes, because I've been among data. People for long enough that now I do the same. But look, at the end of the day, there's gonna be some things where actually knowing the real value is quite important. Either because your estimate might be super off or because actually if we're doing these small optimization things where we're just trying to increase one thing by 1%, 'cause that 1% is still worth a big number, actually being off by a couple percentage points could make the difference between something being a super profitable investment and something setting money on fire.

But for me the key thing here about both product thinking and data and building data products instead of data projects and also metrics trees, which in a way it's just the type of and collection of data products is that we don't go, oh, there's a new project we're doing now. The team needs to go and figure out the basket abandonment rate and figure out all these other metrics. cause we're trying to do this one-off kind of project. And even if it's not a one-off project, but it's a specific use case. We shouldn't have to build all these data models. And then you have 10,000 versions of what lifetime value looks like in your business. You go, we will invest a lot of our time as a data team.

Exactly. Because it is complicated and a few of these things will take many weeks to do into building out these key metrics. And so we have our collection of metrics that form into a tree and basically that you then go, anytime I want to test a new hypothesis, I wanna explore something, I can rely on these core standard reusable data assets, instead of going, everything is a new project and everything is a new DBT model and everything is a new Tableau dashboard and whatever else. For me, that's where the power of it lies. It's a slightly separate topic to the ROI estimation part. I still think a lot of the time you're better off starting with your firming estimates, and then of course if you have the real data to plug into one of those assumptions, great.

But if you don't wait until you do. Unless if it's a really high stakes investment, , if you are gonna be committing to a seven figure investment in your business, then yeah, maybe back of the envelope isn't good enough to get fully started. But if, as you said earlier, if we're just trying to figure out what are the top four opportunities right now out of the 15 we have, and then once we've zeroed in on the top four based on our Rough and Ready firm, me estimates, then we can do more homework.

Then we can send the data team to go and do some more calculations if needed.

Shane: I can actually imagine that as you do the firm a calculations, you're gonna find. There are certain inputs that you use on a regular basis. Especially if you are working in a specific domain, so let's say the domain boundaries are all based on organizational hierarchy or maybe business process.

So , we have a sales organization that deals with the sales side of things and they're the highest value part of the organization that we've been asked to help right now, because we are bound within that domain. I can imagine that, the inputs we use for the estimate, that ROI calculation are sometimes gonna be reusable, we're gonna go, ah, maybe it's lifetime value. Maybe it's the funnel and how many people we're converting from, , suspect to prospect, there's gonna be this thing where we're constantly using that as part of our calculation. So we can say because we're using it so often,, seems to have value.

Therefore doing a little bit more work on what the actual number looks like is gonna be really useful for us going forward. So again, we can use it as a way of figuring out what's the valuable thing to build effectively.

Nick: For sure and I wanna make two comments about this. First is it's not like a lot of these metrics we are calculating specifically to work out ROI, these are also gonna be the metrics that are part of the data product or related to adjacent data products, if for example, we don't know our cost to fulfillment, but we're doing a use case that relates to reducing our cost to fulfillment, , getting to the precise number of what is our fulfillment cost is also part of the business case, so we start with our rough estimate, 'cause we have an average provided by finance, and then when we wanna break it down by product, that's part of the optimization project. That means that next time round we actually have a data product that is our cost of fulfillment per ku, , as an example.

Shane: The other thing I can imagine people wanting to do then is go into the detail to say if this is the return on investment that we've estimated for that information product actually we should measure it at the end to see whether we actually did deliver it. And that makes sense. But then we come to a whole lot of complexity because there's a whole lot of other factors that will influence whether we are getting a reduction in churn or whether we are, reducing the cost to serve or the cost to produce. And it's really hard to isolate those other factors, to say this one thing we did. Now my view is I don't care. If we do some effort and it looks like it's moving the lever, as long as we keep doing effort and the lever keeps moving, we are always getting to a better place. But how do you find it? Do you find that some organizations or some teams want to then prove the ROI six months down the track?

Nick: For sure. And I agree that sometimes the rigor needs to be more than a simple pre-post analysis, especially when you've got dozens of confounding factors or if you're at a place where actually you're carrying out many experiments in parallel. for me, that's one of those things.

It's let's cross that bridge when we get to it, we don't need to assume that by default we need to carry out randomized control trials and super rigorous ab tests in order to figure out what's moving the needle. But then the other thing I wanna call out is that everything you've just mentioned is something that typically does not get talked about at the start of one of these initiatives.

And so the data team, 'cause they've received a request in Jira, they go and fulfill it. There's probably a million back and forths because the request initially that was made, actually, that's not exactly what the stakeholder wanted, but you built it. Now that they've seen it, they realize they wanted something different.

Anyway, eventually you get to your done state and then you go, oh man, it would be great to know the ROI of this thing we've built for this stakeholder. And then you realize that it's unclear what the definition of success is. You've not done that exercise to go, okay, what is the exact decision that's gonna be made by which person and how will we know that they're making this decision because of this data?

And for me it's a little bit like how I'd have classmates in high school who would write their essay first and then go look for citations to prove it. And you know what? In high school that worked But a little bit later down the line, it, it can't, 'cause you need to build up the essay, so to speak, off the back of citations.

Unless if you're just making things up. And similarly, I find it's so much easier to actually figure out the ROI of something, if we've done this exercise that goes, what is the business outcome we're gonna be influencing here? Because when you ask that question, you very often realize that, okay, the thing we thought we needed to build actually is a little bit different, or how do we make sure that we embed this decision support information into the decision making process of a stakeholder? How do we then measure or have some even approximate way of knowing when someone took a decision using that data that we gave them, as opposed to they just went with their gut, same as they have been, but they also opened the dashboard just to see what's there,

Shane: I was gonna say our best measure at the moment is that somebody actually used it, that the dashboard's not sitting there being unused for six months. The best we typically have is it got used lots. Not that it was used for anything, but it was open and run..

Nick: Oh yeah. And for me, even though I think it's useful, especially it's useful to know that something has not been opened, actually the usefulness of that telemetry drops off very rapidly after that. Because you very often have situations where someone is opening the dashboards, but then you don't know exactly what they're doing with it.

If it's even useful for them. And especially if there's some kind of top-down pressure of, guys, we spent all this money on this new set of mi, , you all need to be accessing it, or it'll affect your performance score. Then easily someone can just open the thing, not even have their eyeballs at it, and then close it a little bit later, or they open it and maybe they actually, viewing time has increased because your dashboard is more complicated and the user's not able to actually get things out of it.

And for me, the reason I think it's a mistake is because in most businesses, the number of users you have is just not that big, which means you don't need to rely on quantitative cold-hearted information when you can just ask people, you can have one-to-one catchups.

You can interview people, you can even have a survey form where you ask a few open-ended questions like, Hey, what do you like about it? What do you not like about it? Do you have any ideas for how we can improve it? And use that qualitative data because it's so much richer than, okay, average viewing time has gone up by three seconds.

What does that mean? How does it relate to the business? Also really importantly, 'cause I feel like whenever people are telling me they're struggling to connect the work they're doing to business outcomes and to ROI, it tends to be bi, because the way a lot of BI is built and structured in most organizations is in a way that I just think is wrong, because it's not actually helping enable better decisions. It's just, Hey, let's build a dashboard about this. It's not, okay. As part of my workflow as an operations manager, we're gonna change it. So actually now every Monday morning I open this dashboard, which has these specific KPIs and I make 1, 2, 3 actions off the back of it every week.

That we've built instrumentation to measure how those actions play out, and so we can have an idea of how those actions improve in quality over time. Most of the time you don't have that kind of mapping. You don't have any measurement of either the action itself or the thing the action is looking to influence.

I'm taking an action to improve our copy so that next week's newsletter gets a higher click through rate, which means then, yeah, it's literally impossible to measure retrospectively the impact of that dashboard that you built that someone maybe looks at. But it's very unclear what the benefit is.

It's either unclear because you as the data professional just don't know exactly what they do with it and it's a black box, or it's unclear. 'cause actually there is no real benefit. Or at least it's super, super fluffy.

Shane: what would you recommend to an organization, let's say the organization's got to the stage where the data team are engaging with stakeholders. They're thinking in terms of products. They're asking those questions of, with this data, what action are you gonna take?

And if that action successful, what outcome do you think you're gonna deliver? And then they've led them and done some, whiteboard numbers to say, okay let's quantify this in a really rough way, we think the value to the organization's this. And then they go and build it, and they build it quickly and lots of feedback and it's actually what was needed? How do they deal with that last bit, how do they deal with now looping back and saying, we actually want to know whether it was used, whether it helped that action, whether it helped deliver that outcome.

Nick: for me, generally speaking, unless we're talking about a super small, trivial requests that we're just gonna turn around and it's not a big project. I would basically insist that there needs to be a measurement and evaluation work stream as part of the project. To use the analogy I used earlier, we need to have a kind of research and citation work stream happening in parallel as part of the project.

In the same way that there's gonna be a sort of scoping phase design. We're gonna get approval for the wire frame, then in parallel. We're building the data model at the same time as building the dashboard. At some point in that parallel stage, we're also defining what are the metrics for success.

And if those metrics are not, based on numbers that are readily available because we've got our metrics tree because it's part of another, report, an mi, a dashboard, then we go, okay, now we need to set up some kind of measurement for this new thing that is gonna be part of the success metric.

And then as part of that, we're probably gonna build a second dashboard. That's gonna be measuring that effectiveness, i'll give you an example from a project we were doing a few years ago with a container terminal. Really complex operation where actually anything you try to change is gonna have knock on consequences on many other parts of the operation.

But after doing a combination of qualitative and quantitative analysis, we're like, okay, one of the things that we can dramatically improve for the overall productivity of the port and the north star metric of the port, which is the productivity of the big cranes loading and unloading vessels as they come in, is actually if we can optimize the way we allocate drivers in the trucks that can have a big impact on the overall productivity of the port.

So in that case, there were some metrics that we had already, like the big crane productivity, that was a metric that was firmly established in the organization 'cause they reported on it constantly. Then there were some truck driver productivity related metrics that basically didn't exist in the existing MI suite.

So it was like, okay, if we're gonna build this product to optimize driver allocations, then we also need to add something to the management information that we produce so that we can be measuring the effectiveness of this model, so we can see is it actually delivering the value we were expecting it to deliver?

And for me this points to a really important problem that a lot of organizations face, which is it's not just that, oh, if you have more data that's better and you can make more decisions, you need to have the right data. And it's why I really hate the kind of big data, data lake approach of, oh, let's just dump all the data we have.

Then the data scientists or the data miners will find value out of it. 'cause then usually what happens is when you start a project and you look at what's in the data lake, you realize that actually the super important column for the model you're trying to build, actually it doesn't make its way into the data lake.

Or it never existed in the source system in the first place. And so when we're deliberate about the business problem we're trying to solve, we can also be deliberate about what are the data points we will need and do they exist? And if they don't exist as part of this project, we need to create those data points.

It's not a nice to have it's a must have condition for the success of the project.

Shane: then I'm gonna jump naturally to the next step, which is, okay, so let's say that it's optimized for the big crane and therefore we are going to affect the optimization or the workflow of the drivers. I'm naturally then gonna want to build out , that metric first so that I can benchmark the current state. So then when we make the changes, we can then see have we had a positive or negative impact? And then I can see how that would then potentially delay the production of the information product that has value, because now I'm spending time getting ready for the benchmarking data before I then go and build the thing that actually is gonna make the change. Is that a natural trap to fall into or is that something that If you can, you should do.

Nick: Obviously we're talking about it hypothetically, and it's gonna be a little bit different in each organization, but even in this hypothetical, the two objections I'd have to, that criticism would be one. Okay if this is gonna be a project that'll last several weeks, let's just sequence it.

So that we start collecting this data from the very start of the project. We make it the first thing we do, not the last thing we do, actually, we've done this a bunch of times where we said, you know what? You guys aren't collecting this data 'cause it gets deleted at the end of the day. But you know what, if we start collecting it now, by the time we will need to use it for the model we're building, we will have just enough data.

So that's one thing. And in some cases maybe this will mean, yeah, the project will take longer, but if that's what's needed, then that's what's needed. It's the kind of objection that if it was a let's say a technical objection to do with the core of what's being built, it would be so much easier mentally to say, look, this is a must have, right?

I cannot build a model without this data. Whereas it's very tempting to say, okay, fine. I guess we can build the model without also collecting success metrics. Yeah, maybe you can get away with it once or twice. Orric. Actually, if you're systematically doing that, yeah, you're systematically not gonna know the benefit of what you're doing.

Which for me,, it's so weird because the whole point of having a data team is to make an organization more data-driven. And what being data-driven means to me is basically being evidence-driven. We're not making decisions because of just gut feel or copying our competition blindly, but we're using evidence.

But then we are so bad at basically following our own advice and using data to inform what on earth should we be doing and is it worth doing? So that, that's the kind of first objection. The second objection is you probably should be collecting that data anyway, same as what we were talking about with metrics, trees, the things that are gonna be your success factors.

Not always, sometimes they will be super specific to a project, but other times they're gonna be core metrics that you should start tracking anyway because there's gonna be other initiatives later down the line that we should be doing, but. I would rather let's say, collect that metric about the truck driver productivity as part of this project.

That is a specific optimization we're trying to do instead of do. The other thing data teams often say is, which is, oh, we need to invest in data foundations. Oh, let's build out all the metrics first so that then we can track success later. For me that's bad for , two reasons. Number one, maybe you will track the right success metrics or the right metrics, or maybe when you get to your new use case, you realize this wasn't the right thing anyway.

But also, secondly, you are proposing building something that delivers no inherent value on its own. Which is all too common. All too common that we spend two years doing a big migration promising the business that after the big migration we'll finally be able to deliver value. And what do you know, two years later, CDO gets fired.

New CDO comes in first thing they do, they wanna rebuild the platform and promise again that the results will come two years later. So for me, it's also how can we bundle any kind of, let's call it technical debt or platform investment or foundational investment together with something that's gonna deliver value to the business,

and delivering minimum increments of value. Not, oh, let's do the platform stuff first and the value making stuff later when we've left the company and it's someone else's problem.

Shane: And got the new tools on our CV and got promoted in the next job, 'cause that two years was really great fun. It got paid a fortune and got a better job. We used to blame Waterfall for big requirements up front and foundational builds.

And then lots of organizations went down the agile path and we started seeing six months sprint zeros, where again, it was just pure foundational build. No, engagement with stakeholders, no value. We do have to balance it out between ad hoc behavior where we don't do any foundational work. So there is that, horrible balance of building the airplane while you're flying it, that is the balance. But it is balance, it's about the context.

Nick: I agree. I'm not advocating for basically being the clueless business person that does not understand what the engineers are on about when they talk about technical debt and just wants ship me these features 'cause they're gonna make us money. But we'd need to be intentional about it.

Any piece of technical debt for me is not inherently bad. It's debt I have borrowed from our future selves, from our future data team. In order to usually test something, there's no point building something super robust and scalable if we don't know it's worth scaling in the first place.

But also maybe it's because tactically actually it's gonna really help if we can use this to provide some results for the business before the end of the quarter, either because there's a financial upside or just because it's gonna help us get more trust with them so then they can spend more time with us, involve us more early on in decisions around what we should do and all that good stuff.

Shane: of the things that naturally people want to do is take those outcomes and those ROI statements and link it back to the corporate strategy. There's a strategy, PowerPoint somewhere with four boxes, or there's the top, 11, 15, 25 initiatives that are the most important in the organization. And, it's easy to gamify, it's really easy to say , these metrics support the strategy. It's actually damn hard to find metrics that don't support your strategy in some way using creative language. Do you ever worry about it? Do you ever bother saying that, these ROI statements, these metrics line up with the corporate strategy and the initiatives, or do you just ignore it?

Nick: because I'm not advocating for just, oh, we should link the initiatives we're doing to some other broader business initiative called, I don't know. Be loved by our customers or get trusted by our partners, whatever. For me, that's fluff.

Sometimes it's very useful to link it maybe so it falls under the right OKR or that it gets the attention of the right people. But at the end of the day, most things, you should be able to express their value in financial terms, it's either money in the bank, we have generated incremental revenue, or we have saved costs that definitely would've accrued otherwise.

In other cases, it's a little bit fluffier, but still has a financial number next to it. If we build an AI chatbot that helps our employees get questions about HR policies faster it's not necessarily that, okay, we've saved 10 hours a week from our HR employee, and that's money in the bank because we're still paying that HR employee full time.

Now, if you have a massive organization actually go, no, we will downsize that team. That will be money in the bank. But in other cases, you can just quantify the productivity saving, you go, we have saved the legal department 15 hours a week times 20 employees paid a hundred dollars an hour, and you go, okay, that's not money in the bank, but that is the value of the productivity that the legal team now can redeploy elsewhere. And for me, it's not as good as money in the bank, but sometimes it's necessary that's where our estimate stops. But that estimation is still so much better than just going, we are making the company more productive and AI ready.

Shane: then do you ever bring in the cost side of the data team? One of the things people often say to me is, the data team are trying to do the right thing, but nobody from the organization will engage. The request gets handed across, they. Want somebody from that part of the organization to work with them as subject matter expert, whatever. And I often say to them have you worked out what the data team costs every week? And have you gone back and said, Hey, this is four weeks worth of build. That's a hundred thousand dollars of time. Is it worth a hundred thousand dollars to the organization? And often data teams don't do that, they don't quantify their own costs. Do you do that as part of understanding the ROI or do you not worry about it as much?

Nick: Of course, 'cause ROI is not just what is the value, it's return on the investment. So the I part is basically the cost. And here it can get complicated because what are the fixed costs that we're just assuming are sunk in? Very basic example, will I consider the Microsoft teams license of each of my data scientists as part of the investment?

Or do I just go, look, this is a fixed cost the business has made and we're talking about incremental return on investment. Meaning how many additional hours of a data scientist time are we gonna spend? Or maybe if we need to hire contractors, it's about the cost of those contractors, or it's about the incremental compute that this will cause because we're now gonna have to pay Databricks a bunch more money because we're doing all these, building all these new models so that takes some.

I think finessing and it basically, it requires you to understand what, I don't wanna say what you can get away with in your organization, but more how does your organization think? Are there some costs that ultimately they're subsumed in some centralized budget? And so one, you'd be doing yourself a disservice if you included those because the rest of the business isn't.

And two would maybe skew the picture of whether this is a worthwhile activity for me I also think the thing that doesn't really work as effectively is to just turn around to the business and go, this request is gonna cost a hundred K. 'cause usually it's not like they will pay for that if you have some kind of cross charge model or you're like, look, we would have to hire new people to support this new initiative.

And so we need to work together to make the business case for finance to give us this additional budget. Then you can definitely do it and it makes sense and it potentially will come out of the marketing team's budget not out of yours. Again, depends how your organization does budgeting. But for me I think it more in terms of if I'm running the data team and I know that cost, I know how much it's costing us to keep the lights on every week.

First of all, I wanna make sure that at the very least we are delivering more than that, like a multiple. If we're costing a hundred KA week than any given week, we should be delivering at least 110, if not 200 k back to the business. And to be reporting that upwards on a regular basis. 'cause it's sometimes also easy to forget.

And if it's not about that, if it's not about the kind of holistic picture of what is the data team doing, at the very least, using the, even without the cost part, the value part to figure out, okay, my team could work on any of these three things for the next two months, which one is worth doing?

And then using that language, not just to decide internally amongst ourselves, which use case are we gonna push for, but also if that involves telling two stakeholders that their baby's gonna get deprioritized because we need to work on the other use case, point them to those financial numbers.

Be like, look, this will cost, your thing is gonna cost the business 200 K, but based on our projections, it's only gonna make us an extra a hundred K. And the way I usually frame it is I go, look, if you can find a way to get acceptance for that kind of thing, maybe we can do it. But it's really hard for me to either deprioritize someone else.

10 XROI initiative that they've put in front of us. Or just to justify it to my boss that hey we're gonna be costing the business money because Bob really wants us to build churn prediction model version 57. 'cause version 56 wasn't good enough.

Shane: that comes back to that really interesting question of who does the prioritization. Because often in an organization, it's a committee, it's a group of stakeholders , whoever's got the biggest voice or whoever's the hippo gets to choose and it gets given to the data team. Whereas if we have this idea of a data product manager, or data product leader or whatever you wanna call them, then often they are the ones that should be making the final decision, they should be saying, okay, based on the value then these are the ones that are the most valuable to the organization.

So this is the one we're gonna build. But often data teams don't work that way. Again, it comes back they get given a data request and somebody outside the team effectively prioritizes and tells them what's the next most valuable thing is, but that person's not held to account for the value.

Nick: I think my answer is to which model works best. Also depends on the organization and the politics. Who controls the budgets, the power, what happens when, but in general, I'm against either extreme. I'm definitely against the business, has decided the priorities and they just chuck them to the data team who have data product owners whose job is actually to just decide maybe the exact sequence of delivering these things.

And there's no product work being done there. We can rant about safe and scale agile another day. But then secondly I think it's also not good if especially if it's framed as the data PM is the final decision maker. And for me it's fine if maybe me as the data pm I've got all these requests from different departments and I do that prioritization, but then I wanna involve my stakeholders in that.

And here it, it gets a bit murky. It's not, I'm gonna give everyone a vote. I'll be like, look guys, we've got these five initiatives and they all have a different number of zeros attached to them in terms of what they're gonna bring to the business. And so maybe if Bob says, look, I know mine is only a six digit opportunity compared to everyone else's seven digit opportunity, but actually for whatever let's say strategic reason or because it might cause bigger problems later down the line, it's more important and there it's gonna be a balance between, sometimes we're gonna be more consensus driven depending on how the organization works.

And other times I go, look, Bob, if this is so important, basically you need to escalate upwards because my hands are tied. I cannot put your a hundred K thing in front of someone else's 9 million opportunity, assuming they're gonna take a similar amount of time Anyway,

Shane: I think we see that in the software product domain. We see that product managers often are reliant on influence to get the right things built, and that's effectively their job, I find it interesting that if I look at other shared service type , part of an organization, so if I look at HR and I look at finance. They are a cost, and yet they very rarely have to do ROI statements for the value they deliver. And for some reason data does. And my hypothesis is because data typically came out of the IT side of the org. And that was always, seems to be somehow having to justify what they're spending their time on more than those shared service organizations.

What's your view on that? Is that what or do you not see that?

Nick: Great question. The kind of thing that comes to mind first is actually the fact that we see, for example, HR as a cost center, and often it also as a cost center, especially the more base it functions is a problem, it's a problem because it means that a lot of things become invisible to the business, the fact that we see our expense tracking software, that's just a fixed cost, it's a cost center. It's the cost of doing business. And that's why it has selected the most competitive offer in terms of which horrible expense software we're gonna use. And what doesn't happen then because it doesn't have to prove the value or the return of using that software is, there's no, explanation of opportunity cost we're not looking at, for example, okay, on average, because we're using our. Shitty home-built software from 20 years ago that our employees need to use Internet Explorer taxes to submit their expenses instead of just giving Concur some money and using that instead, or something even newer that can than Concur that wastes on average five hours a month from 20% of our employees that do lots of business travel.

And when we add up the total cost of that, actually it's a multiple of using the fancier software as a service expense software that it doesn't wanna use. So my first challenge is actually, I think a lot of teams and departments are seen as cost centers when actually that's not right either.

Because we're not evaluating opportunity costs, we're just evaluating costs in a vacuum. But then secondly, I'd say, I think it's fine and good for data to have this slightly unfair expectation of proving our value. The reason is maybe you tell me. Look, by and large, we don't need to think about the ROI of the laptops we have and of using Microsoft Office because it's well established, it's commoditized.

We know that we just need to have it in our business. We just need to have some HR information software. We just need to have a payroll provider. And for some of those actually, yeah, cheapest is best. Doesn't make a big difference. If we're using the newer fancier payroll provider, its job is to just get money from A to B, in data, that's less the case, in data. Partly because we're more nascent profession. We haven't figured out how to professionalize our industry, if that's even the right thing to do. There's a lot more question marks. But also, secondly, we're not just providing BAU services to the business.

We're helping innovate. Or at least for me the more value additive things a data team can do. It's not, we're gonna churn out bi dashboards, showing metrics for the sake of metrics. It's, we will help the business use data to improve the quality of decisions to potentially unlock new streams of revenue, to build new products.

Now with AI and the fact that you basically need your data function to build a lot of the infrastructure, if you wanna have an AI feature in your product, that's becoming more obvious. But it was true before that as well, it was true before that machine learning could be used not just to recommend a decision in a BI dashboard that someone can ignore, but actually automate a certain part of the process, to automatically spit out recommended content for someone looking to watch a movie or looking to buy something from an e-commerce store or whatever it might be. I don't want the data team to act like a cost center, because then we're just gonna default to doing. Bare minimum low value adding tasks that can be commoditized and expressed in the form of ServiceNow or Jira Analytics and maybe we need to do a little bit of that, but if that's our main focus, I'm gonna change careers,

Shane: yeah you made me giggle on the expense one. I remember it was probably two examples actually. One was somebody working the defense forces that could go out and spend billions of dollars on new tanks but then had to go through seven layers of approval and the expense system to get their parking ticket. The cost to park their car paid and then somebody that was, Working for a software company and used to do seven digit sales and again, same thing would spend four hours of their own personal time on the weekend going through the expense system to claim back the coffees. It's just amazing, when you think about the cost of that. Alright, just looking at time, I just want to close it out with a question around what makes a good product manager and let me frame it in a certain way. Back in the day when I was working, with teams that were trying to adopt Agile and we're primarily going down the scrum path, so picking up the patents from Scrum and yeah, good point on safe. We probably don't have time for me to rant enough about that one, but in Scrum one of the roles that was common was the idea of a scrum master. And what I found was business analysts for some reason. Naturally did the scrum master role , if I was looking for somebody to come in as a new person into that role and somebody had a business analyst background, I found they approached that role in a certain way that made them in the team successful. Whereas in my experience, and this is my experience, if somebody came in from a fairly long project management background, they didn't.

After a while I saw the Pattern and the Pattern really was a project manager wanted to stand in front of the team, and a business analyst was happy to stand at the back. , But what have you found if you were looking and saying, there is a natural set of skills or a natural role that lends itself to adopting the role of a data product manager. What have you seen? Where would that be? I.

Nick: I've thought about this quite a bit. I'd say it's two things that we can call aptitudes, but I think they're learnable as well. It's not, oh, inherently you need to be like this. It's not about are you technical, are you not, it's not, have you worked in data? It's not, have you worked in product before?

To be a good data product manager and also a product manager In general, I think two things make the most difference. Ownership is number one. It basically means are you invested in the outcomes you're trying to enable? Or do you just go, my job description said I needed to do this task and I've done it so it's not my problem.

The product is still not making money for the business. That sense of ownership has a huge impact in how you go about executing in your role. And product managers often have to basically fill in the gaps. And in one team, that gap might be more on the technical leadership side. On another one it might be to do with helping marketing or maybe you're doing the marketing 'cause there's no one to support you there.

And another one, you need to do financial analysis to figure out the right pricing or to work out the cost. So you need to be fluid. And the best way for someone to embrace that is to have that sense of ownership because they're invested in the outcome, not just thinking of their role in terms of processes and outputs similar to those project management types that you mentioned.

And then the second thing is a sense of curiosity. Because again every product is different, every team is different. You need to have and cultivate a sense of curiosity to learn more about your users, about your business, about the technical underpinnings of your product, about what the data actually shows and means.

Because otherwise you're gonna end up being either the kind of PM persona that doesn't understand the tech at all, doesn't join. The engineers delegates it fully and they just become an information sifter. They go to their stakeholders and they're like, here's the copy paste answer. The engineering team gave me about why the product doesn't work as you liked.

Instead of, Hey, because I understand where your feedback is coming from. I've been working with the engineering team for us to figure out how we need to rebuild this for it to make sense. And it's also the same in terms of understanding your stakeholders, if you don't have a sense of curiosity to be like, tell me more and let me understand this problem.

You're never gonna build up that picture of the business and the pain points that you need to connect the dots and do more than just turnover requests that are coming reactively to you, in order for you to be proactive and in order for you to be able to actually propose initiatives to the business that are gonna make a strategic impact, you need to have that curiosity to understand what is it that we're doing both on the business side and the tech side.

And for me, those are muscles that you can exercise or that you can let atrophy. And so I always recommend to people that are either in the role or thinking about the role is how can you start demonstrating those things, not just to say in a job interview that, hey, I've done them, but to be well placed to actually do a good job when you get the job.

Shane: I think one of the things you said there was for me, one of the really important ones be inquisitive around what the problem is. Understand the problem itself before you worry about the solution. Because otherwise, the solution may not solve the problem, may be a great solution, but it may not solve the problem, which again, comes from, that product thinking is understanding the problem the customer has, maybe jobs to be done.

I do like the jobs to be done stuff. helps articulate things in a way that makes me think different and therefore I've gotta keep challenging myself to well, actually what is the job to be done? What is the problem to be solved? , So again, just keep asking yourself that, and if you can't articulate it, that means you don't understand it. Which means if you're stuck, which you are between , organizational stakeholders and the data team you've gotta understand both sides, and that's a good way of thinking about it.

Excellent. If people want to get a hold of you, what's the best way to find you?

See what you're doing, what you're writing. Get in touch.

Nick: Best place is probably LinkedIn. You can look me up. There's only two Nicker buds out there, and the other one does not do anything related to data. I also write some longer form articles on Substack, but all that stuff is linked on my LinkedIn where I spend most of my time kind of writing shorter form things.

Shane: Did you start out on Substack or did you start in Medium and jump to Substack?

Nick: No, I started out on Substack. Before that I had a personal blog that was not data related at all.

Shane: it's interesting that kind of, LinkedIn and Substack seems to be the default now for people that are sharing their knowledge in the data space and, , I don't know about you. We find both of them lacking in certain areas. So it's gonna be interesting to see if a new product turns up that takes over both the networking capability and the content sharing.

We'll see.

Nick: I just keep getting Chrome extension ideas for how to make both of them suck less, and I might start building out some of them just to help my own problems. I'm a big fan of both.

Shane: Well if you can solve that one for me. Something that sucks less on both of them was something I'd buy as well. So look forward to that one. Excellent. All right. , Thanks for your time. , That's been good in terms of articulating some clear patterns around ROI, things that are relatively simple. As one cicada says, don't boil the ocean. I think that's one of the key messages that I got was don't overthink it, do enough that it's useful and then just keep getting better and better at it.

Nick: Thanks for having me, Shane. Thanks for listening to all my ramblings and yeah, a hundred percent. Keep it simple. It's easy to do, complicated. The hard thing is to simplify.

Shane: It's been a great chat. Thank you for that, and I hope everybody has a simply

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Dimensional Data Modeling Patterns with Johnny Winter

Shagility — Mon, 18 Aug 2025 19:55:51 GMT

Join Shane Gibson as he chats with Johnny Winter about the core patterns that make up Dimensional (Star Schema) Modeling.

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/dimensional-data-modeling-patterns-with-johnny-winter-episode-73/

Listen to the Agile Data Podcast Episode

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Briefing Document: Dimensional Data Modelling Patterns

Overview: This podcast episode of "Agile Data" features Shane Gibson and Johnny Winter discussing the enduring relevance and practical applications of dimensional data modelling. Johnny Winter, a seasoned data consultant with a background spanning from Crystal Reports to modern data stacks, provides a comprehensive breakdown of core dimensional modelling concepts, common patterns, and nuances often overlooked. The discussion highlights why dimensional modelling remains the "number one analytical modelling technique in the world," even with the advent of new technologies and approaches like data vaults and activity schemas.

Key Themes & Important Ideas:

1. The Enduring Relevance of Dimensional Modelling:

Despite its age (Johnny started his career when it was prevalent), dimensional modelling is "pretty much still the number one analytical modeling technique in the world."
Its popularity has seen a resurgence with tools like DBT, indicating its continued applicability in modern data stacks.
The widespread availability of resources, particularly Ralph Kimball's books and blog posts, has made it highly accessible and understandable: "Kimble and Margie Ross they wrote a hell of a lot They write lots of books that were easy to read and understand."

2. Core Concepts: Facts and Dimensions:

Dimensional modelling organises data into two primary categories: facts and dimensions.
Facts are the "measurements, the things you actually potentially want to aggregate or trend," often representing "the event." Johnny uses the "how many" from Lawrence Core's 7 W's as a fact. Examples include order value, payment value, or sick days.
Dimensions provide "the context that you apply to those," enabling "slicing and dicing of the data." These relate to the "who, what, when, why, where" of an event. Examples include customer, supplier, employee, or store.
Historically, this separation was driven by "constraints" (performance and cost) in databases, but it still promotes "reuse reasons" today.

3. Grain of Facts:

Grain refers to the "level of detail" in a fact table.
Johnny advocates for modelling at the "atomic grain" or "lowest grain anyway." He states, "You can always roll up grain easily in a query... but it's very difficult to do the other way around."
While historical constraints led to "aggregated facts" for performance, modern "column store databases and now storage formats like paret" make aggregating granular data much easier and reduce storage footprint.
The default should be the lowest possible grain (e.g., "order line" rather than "order") unless there's a specific, justified reason for aggregation (e.g., performance, ease of use).

4. Slowly Changing Dimensions (SCDs):

SCDs address how changes in dimension attributes are managed over time, allowing for "asis and as was type reporting."
SCD Type 0: The attribute "never ever changes." (e.g., a time dimension with 24 hours in a day).
SCD Type 1: The attribute gets "overwritten." Historical records are updated to reflect the current value, meaning "all of my historical records will now point to that value." This can obscure historical analysis.
SCD Type 2: "Allows you to track changes over time." A new record is created for each change, preserving historical context. This typically involves "valid from valid to date or a start date and end date" and a "surrogate key" to uniquely identify each version of the business entity. Johnny notes this is generally preferred now, though some clients still opt for Type 1 for "historical reporting perspective."
SCD Type 3: "History recording except for rather than with a type two where you get an extra record when the value changes you get an additional column." It only tracks the previous version.
SCD Type 6 & 7: Hybrids, often involving "durable keys" for more complex scenarios, but "most people do ones or twos."
Key Management: The importance of surrogate keys is highlighted, especially for Type 2 dimensions, as they provide a unique identifier for each record in the data warehouse, abstracting from the business key which might not be unique across historical records. While "hash keys" or "concatenated business keys" are possible with modern tech, Johnny prefers surrogate keys as they "forces you to go back and look at your dimensions and make sure that your values exist."
End-dating strategies: The choice between leaving an end date as NULL or using a 99999 value, and the concept of "windowing" (which emerged from Hadoop's insert-only preference) versus direct end-dating, are implementation-level patterns that require careful consideration based on technology and cost (e.g., BigQuery's cost optimisation for end-dating). Consistency in these patterns across a data platform is crucial.

5. Fact Table Types:

Transactional Fact Table: "Pretty much append only," recording a single event as it happens.
Accumulating Snapshot Fact Table: Used for processes with multiple defined steps. A single record is updated as the process progresses (e.g., tracking a podcast booking process from invitation to confirmation).
Snapshot Fact Table: A "full data dump every single day," capturing data at a specific point in time (e.g., stock balances, month-end closing balances).
Type 145 Fact Table (Accumulating Time Span Snapshot): Less commonly mentioned, this "process driven type fact table whereby the status of something changes and you get a new record every single time." It's like a Type 2 SCD applied to a fact table, excellent for tracking states in a process like a sales funnel. Johnny calls the common misnomer "SCD Type 2 Fact Table" an "oxymoron."

6. Dimensional Design Patterns (Conformed, Role-Playing, Junk, Degenerate):

Conformed Dimensions: "Reuse the same dimension across multiple facts." This promotes consistency and reusability across different business domains (e.g., a single Date dimension used for sales, finance, and HR reporting). This is a hallmark of dimensional modelling that differs from other approaches like Data Vault.
Role-Playing Dimensions: A "single dimension that can be reused for multiple different things in different contexts." The classic example is the Date dimension serving as Order Date, Delivery Date, Refund Date via different foreign keys in the fact table. Another example is a Location dimension serving as both From Location and To Location.
Junk Dimension: Addresses the "centipede fact table" problem (a wide fact table with many narrow, low-cardinality dimensions). A junk dimension combines "all those little bitty low cardality type things" into "one fat dimension that combines all possible given combinations of them." It's a "miscellaneous dimension" that simplifies queries, though it's "not a particularly common pattern."
Degenerate Dimension: A "dimensional value" or context "only ever relevant in the context of a given fact," so it's "retained it on your facts table instead." This avoids joining to a separate dimension. While it can reduce joins, Johnny "try and avoid where I can" as requirements can change, leading to a need for that attribute in other contexts, making reuse difficult and potentially complicating user queries if there's no semantic layer. Modern analytical engines are also "optimizing their engines for that BI workload for that kind of star schema type shape and it it deals with them absolutely fine," so the performance argument for degenerates is often outdated.

7. Anti-Patterns & Justification:

Single "Thing is a Thing" Fact Table: Having one huge fact table for "every event all at the lowest grain" with highly abstract dimensions (e.g., a "people" dim for customers/employees/suppliers) is an anti-pattern. Dimensional modelling aims to reflect "business language," so dimensions should represent understandable business concepts. Separate dims for Customer, Employee, Supplier are generally preferred.
Joining Across Facts Directly: This is generally discouraged due to "fan trap and chasm trap" issues and differing grains. Instead, "drill across" functionality relies on "conformed dimensionality" and aggregating queries at the dimension level.
Factless Facts: A fact table that "doesn't have any of those things [measures to aggregate]," but "basically just stores the intersection of the various dimensions and ultimately end up counting rows on it to get the facts." It still has a 'fact' (the count of rows). Johnny prefers to just store the keys, not an extra column of '1's, as that's "bloat that you don't need."

8. Advanced Topics & Nuances:

Bridge Tables: Primarily used to "resolve many to many type relationships" between facts and dimensions. The example given is a joint bank account where one transaction (fact) relates to two customers (dimension), requiring a bridge table to resolve this. They can also help with "recursive hierarchies." Shane clarifies, "Bridge tables are a bridge between the facts and the dims not a bridge across facts."
Late Arriving Facts: When a fact arrives before its corresponding dimension record has been loaded. This is handled by assigning an "unknown member" (a default surrogate key like -1 or 99999) in the fact table, and then "rolling windows" or later updates to assign the correct dimension key once it's available. This ensures "referential integrity" even if it means pipelines run "slower and a bit more expensive."
Impact on BI Tools (e.g., PowerBI): PowerBI's Vertipac engine (compression engine) and DAX language are "structured to work with" star schemas, making it "far more efficient" than a single large table for analytical workloads, especially for compression and querying.
Layered Data Architectures: Dimensional models often sit as a "core reporting layer" on top of other modelling techniques like Data Vault (which is typically not exposed directly to analysts due to its complexity for querying). This provides "flexibility" and "context" while making data "easy for the end user."
The Importance of Context and Trade-offs: The choice of pattern always depends on context. "The nuance is in absolutely understanding what patterns are available and which ones to use when." Sometimes, knowingly implementing an "anti-pattern" might be justified for a specific edge case if it "makes sense in the right context and you can justify it."
The Consultant's Mindset: Johnny and Shane discuss the need for a "dimensional checklist" when starting with a new client, to quickly understand their existing patterns and ensure consistency.

Conclusion: This episode serves as an excellent deep dive into dimensional modelling, re-emphasising its foundational role in data analytics. Johnny Winter expertly navigates complex concepts, providing practical examples and personal insights into common challenges and best practices. The discussion underscores that while technology evolves, the core principles of dimensional modelling remain highly effective for building robust, performant, and user-friendly analytical data platforms.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I'm Shane Gibson.

Johnny: And I'm Johnny Winter.

Shane: Hey Johnny. Thanks for coming on today. We are gonna talk about the patterns of dimensionally modeling. It's an interesting one for me because I started my career in data when we did lots of three and F, and then we went on to dimensionally modeling.

I'm that old. And then I moved on. I used a bunch of other patterns and I was sitting back the other day thinking, holy shit. I've had a bunch of podcasts talk about modeling patterns, but actually nobody on to describe dimensional modeling, which pretty much is still the number one analytical modeling technique in the world.

Before we rip into that, why don't you give the audience a bit of background about yourself?

Johnny: Yeah, absolutely. So. I guess some people listening, actually, even as you did that intro, all that was going on in my head was the intro music. 'cause that's like a ripping tune and if I ever do my own podcast, like the standard's been set in terms of, uh, intro music.

But yeah, me, I'm Johnny. I, I'm based in Preston, in Lhi in the uk. I have been working in data since I always say 2007, somewhere at my house. I've got a certificate where I went on a Crystal Reports course, so that's where I cut my teeth as a data professional. So basically a report writer, analyst type role using good old business objects, crystal reports, that's where I, my sql then effectively graduated, which think is quite a common path from that kind of sort of analyst type role into more the sort of full BI developer as we called it back in the day.

Moved on to building, developing data warehouses, the ETL and all the dimensional modeling that came with that. Historically that was good old Microsoft stack, so on-premise, SQL server integration services, those kinds of things. As I still did like the reporting services element of it as well. So I did the full sort of end-to-end.

It was before we got trendy and had data engineers versus analysts as it were. I used to do absolutely everything and then, yeah, graduated more into cloudy stuff from a tech perspective. In my last four or five years, I've been working in data consultancy for a couple of different consultancy firms, and I'm a consultant today looking to start my own independent consultancy imminently.

Shane: Wow. Welcome to the chaos that is doing a business, running a business and trying to grow a business. Yeah, looking forward to it. How? How to take a 40 hour job and make it hundred 20 a week.

Johnny: Yeah. So simultaneously terrified and excited. What is it? Find a job you love and you never have to work another day in your life.

I'm pretty sure that it works the other way as well. Find a job you love and you'll never have another day off in your life.

Shane: Yeah, I think the key I say to people is always make sure you have annual leave. The problem with consulting is there's gonna be lumps and bumps. There's gonna be times where you're not working due to no choice of your own, and because of that you tend not to book holidays and then you burn out.

So yeah, just get somebody else to help you run the company, your partner or whatever, and say, this is annual leave. You've gotta take it every year. 'cause otherwise you won't and you're regret it. Crystal reports though, the real question there to age people is was that crystal reports before business objects or after business objects?

Johnny: So in terms of my exposure to it, I started using it just after it got acquired. So I think its Seagate before that. So when I started using it, the company I worked for at the time, there was this Seagate footprint everywhere. Like quite a lot of the system accounts that we'd used for accessing databases.

'cause we didn't do things properly with like you would do today with service principles and whatnot. So a lot of the accounts that we'd accessed, things were like Seagate, everything was labeled Seagate. But yeah, they had just been acquired by business objects at that point in time. I think by the time I stopped using it was just as they got acquired by SAP as well.

But yeah, it was just the customer report side of it. I never really touched like the university side of it, the semantic layer, which is weird 'cause I'm like a bit of a semantic layer geek now. I guess my sort of public persona in the kind of data community, a lot of it has been built around Power bi, which is a semantic model, is like I'm a big fan and one of the reasons that I'm a big dimensional modeling fan as well because it's one of the things that Power BI is absolutely optimized for.

But yeah, never really got to grips with universes. It was just a pure Crystal Reports guy, and at the time it was Crystal Reports always on our LTP databases as well. When I learnt my trade, I didn't know what a date warehouse was. I didn't know what a dimensional model was.

Shane: I think it's back in the days we used to do ods, which were really just near real time replicas or overnight replicas of the source system.

And then you were munging all those horrible tables together and there was no analytical layer as such. It was just bunch of bloody horrible queries, which is where business objects in the universe is actually had massive value. And then from memory, it took them ages to get crystal reports to run against the semantic layer.

It reminds me of SQL Server reporting Services, SSRS, where again, that was direct query against the relational databases. It never really had the semantic layer that you got when Power BI kind of took hold.

Johnny: Yeah. Funnily enough, Krista reports to SQL Server reporting services was exactly the sort of step that I took as I left the role I was doing using Krista reports.

We were actually in the process of migrating to S server reporting services instead. So I had a bit of a foot in both camps to an extent. Definitely from a analysis service perspective, you could connect SSRS reports to analysis services semantic layer really easily. But the crystal reports that I had exposure to is exactly what you described overnight, ODS, just a database replication once a night and then you'd write your reports.

And I got forever frustrated about the fact I'd have to write the same convoluted business logic over and over again for lots and lots of different, similar concepts, similar reports for stakeholders, oh, we need to do this thing, and I'd do it and be quite pleased with it. And it was like somebody else wants something a little bit different, perhaps a variation on it.

And I'd have to. Reuse some of that logic all over again and it sent me down a bit of a, a rabbit hole in terms of trying to research that there must be a better way. And that is when I picked up my first copy of the Date Warehouse toolkit and learned about what a date Warehouse was. And ultimately I actually left that role because the project to implement our first data warehouse spent two years getting off the ground and eventually I ran outta patients.

'cause they'd not even started it by the time I actually ended up resigning because I was like, no, I'm fed up with this. I'm gonna go work somewhere that's actually doing the things I want to

do. Forget the days where we used to spend six months to a year doing requirements and then six months to a year buying hardware and waiting to rack and stack it and get the database installed before we could even start.

Shane: The world has certainly moved in a good way.

Johnny: Yeah.

This is working in defense industry as well. So they were very risk averse and the amount of red tape we had to go through for any kind of IT type projects was just horrific. Anyway, so I get the impression speaking to former colleagues that even with the way the world's gone these days, it's still like that they're still mostly on-prem based and getting anything up and running.

Still very slow. So I'm

glad I got out what I did. Let's go into that idea of data warehousing, dimensional modeling, star schemas, all those good words. Just kinda want to go through and just discuss the patterns. 'cause what we're seeing is quite interesting with the adoption of DBT as a tool. Dimensional modeling seems to have come back to the four.

For those people that are modeling or consciously modeling. Dimensional modeling seems to have come back as again, the number one modeling technique that is used in those kind of modern data stacks. So let's start off at the beginning. Dimensional modeling, uh, has this concept of facts and dims. Talk me through those.

Yeah, that is not far off the standard sort of textbook answer. If you were to describe what is dimensional modeling, organizing your data into two categories of table facts or dimensions, the facts tend to be the measurements, the things you actually potentially want to aggregate or trend. The dimensions are effectively the context that you apply to those.

So I always talk about the slicing and dicing of the data. I think like when you first asked whether or not I'd like to be a podcast guest and talk about dimensional modeling, we talked about the fact that I was going to potentially do a bit of a blog series about dimensional modeling and that perhaps we could then wrap that up as me being a podcast guest.

And I think I wrote part one. And part two still in draft, and I've just not gone round to it. And I was like, yeah, okay, Shane, let's just do the podcast because the blogs are going slow. But I've still been putting quite a lot of thought into what the content of those blogs is gonna be. And one of those is absolutely the sort of Oh, dimensional modeling.

Yeah. It's just facts and dimensions. And then you dive into it and it really isn't, it's actually lots more nuanced than that. So I guess the first layer of the onion is very much the, yeah. Facts, the things that you wanna be able to measure the event almost is the way that I describe it, especially in the sort of Lawrence core business event, E type context.

And then, yeah, the dimensions being, when I think of my seven W's from a Lawrence Corona Beam perspective, my how manys being my facts, and then my other W's, my who, what rental, why wear how. We know those might be the dimensions that are gonna sit around it.

Shane: Yeah. Think about it as dims of things. We want to look at the things we can see, customer, supplier, employee store, those kind of things.

And then the facts are things we wanna count, order value, payment value, those kind of things. Yeah, absolutely. And what we're doing is effectively we are breaking the data out into those types of tables primarily. In the early days it was around constraints. So our databases were constrained in such a way for performance and cost that we couldn't just load all the data effectively into a data like inquiry at willy-nilly.

We actually had to model it in certain ways for performance reasons as well as reuse reasons. And that's where it came from. And so yeah, we've got this idea of dim being a thing and the fact being a measure of those things. And then the next thing that often we need to talk about is grain. Yeah. So as soon as we talk about a fact, we will typically wanna have a conversation about grain.

So do you just want to explain how you think about grain of the facts? In many ways

Johnny: being brutal about it. I don't tend to overthink about my grain of my facts too much and like historically, as you say, from almost like a technology constraints perspective. You potentially end up with aggregated grains and things like that.

But I've always just gone in at trying to always model this sort of atomic grain at the lowest grain anyway. So grain for me is always about level of detail, and when you build a fat table, you've got to figure out what level of detail you're gonna get to. So I'm always aiming to get to the very, very lowest grain of detail I can.

You can always roll up grain easily in a query. You can always do an aggregation on top of a very granular fats table, but it's very difficult to do the way round. So for me, that grain, it's trying to get to the lowest detail of information available for me to an extent in terms of a dimensional model as well, I'm looking at the sort of the cardinality of the facts table to its dimensions as well.

So I want my dimensions for each unit of measurement that's in my facts table should only relate to one value in my dimensions. In an ideal world. We used to

Shane: talk about transactional facts and aggregated facts. The aggregated facts back then was a constraint based model. We couldn't hold all the transactions and then query them, and with an aggregated query fast enough, we needed to materialize it or physicalize the aggregations for a performance reason.

Whereas now, not so much we've got the fire power to be lazy and bring it all through, but we get less work up front. More value down the stream.

Johnny: Yeah, so I think column store from a going down a sort of what I call a techie gobbins perspective, the advent of column stores really help with that column, store databases, and now storage formats like Par k being common based as well does make aggregating data really easy and it helps reduce the storage footprint as well.

So you're getting the best of both worlds with it.

Shane: But if you look at a fact, you're still gonna say, is the grain of that fact an order or an order line, aren't you?

Johnny: Yeah, absolutely. I'd still always strive to just go as low as I can. So in that scenario, I'd always, we should build a order line. 'cause we can always roll up to order.

But yeah, absolutely.

Shane: So treat your facts as the lowest grain possible for now. And then if you have to aggregate or change the grain for another fact table, then you're doing it for specific reasons. It's not anti-patent, but you're applying that aggregated Pattern for a specific reason. Performance, ease of use.

Yeah, something like that. Rather than using it as As the default. Yeah, absolutely. And then the next core word is slowly changing dimensions. Type one, two, what is it? I can't remember. Seven. Yeah, there's all these numbers

Johnny: seven's as far as I've gotten to in terms of understanding them. In fact, I mean I could say that there is no way that I could reel off what all seven are.

'cause they're not. Just not used that much. It's zero index as well. I think sometimes type people forget there's a type zero to start with two. So yeah, dimensions effectively the context of your dimension with regard to your facts and talking about, the phrase I've been using most recently is as is and as was type reporting.

And the one that I like to talk about is like a sickness record for argument's sake. So if you're somebody in HR and you wanted to understand which job titles potentially, or which job roles potentially cause the most illness, and we had a fact table that recorded a transaction every time somebody the other day off sick and then a dimension for the employee.

And one of the attributes of the employee was what their job title is so that you can then slice it and dice it and say, ah, we can see that data consultants have had a hundred sick days this year. Whereas agile coaches have only had five sick days, so maybe we need to give a bit of wellness training to our data consultants.

What you've got to take into account is the fact that someone's job title can change over time. So I'm a data consultant now, but I'm gonna be a business owner. In a couple of months time. You were a data consultant and now you're. Agile data coach ish. So yeah, type zero is the, basically, it never ever changes.

The best example I always have that is the time there will always be 24 hours in a day. There will always be 60 minutes in an hour. There'll always be 60 seconds in a minute. So if you've got a time dimension, once you've defined it, you're never gonna have to change it. Type one is that it gets overwritten.

Me, for example, as a, currently as a data consultant, if my job title changes to business owner, the record just gets overwritten. But what that means is that all of my historical records will now point to that value. So my historical sick record would now have all of my sickness loaded as a business owner.

So it might cover up the fact that data consultant was the thing that was causing the stress. Type two basically allows you to track changes over time. So every time there's a change to the record, it creates a new record with it so that you can basically then say that, ah, all those absences that happened last year.

The job title is data consultant, all the absences next year, not that I'm ever gonna have any, the business owner absences. And then you get a, essentially a more accurate reflection of that particular analysis, but you've always got to bear in mind that isn't necessarily what the users always want.

Sometimes the users want to know what the current is. We speaking with a client the other day and they were talking mergers and acquisitions, and so they were talking about the fact that if a particular entity changed its name, they would always want the historical records to be recorded against the current name of the entity because from a historical reporting perspective, they'd want it to be overwritten.

So you gotta understand it from that perspective. Type three is history recording, except for, rather than with a type two where you get an extra record, when the value changes, you get an additional column and it basically only ever tracks So you the the previous version of it. So you don't get the full history, but you get the previous.

The good fun I've been getting into recently is type sixes and sevens. So type six is a hybrid one and three, and type seven is a hybrid one and two, which is then you get into the realms of the fact that you can do both and starting to use things like durable keys. I'm going properly down the rabbit hole at that point,

Shane: and I think it'd be fair to say that most people do ones and twos.

Very rarely do you go into the other numbers unless there's a really specific use case and then it's well documented how to do it. You just need to know they exist. It's very rare that you are gonna bake that in. And again, I think in the early days we used to have to make a call between ones and twos because again, we had a constraint on how much data we could store.

So we would cherry pick which dimensions. We moved to a type two because it involved more data, more complexity. Whereas now I'm guessing everybody is. Type two in by default and then may surface a type one view. When you query the data, you always just see as it now, and if you want to, you can query another view, which is as at a certain date if you choose it.

Or is that not true? Are people still defaulting to type one by default and then type two by exception,

Johnny: my experience has been more type one by default, type two by exception. Yeah, there's so many nuanced arguments in terms of the way to implement it these days, like if even getting into the realms of persisted keys and things like that.

'cause I'm like an old school surrogate key guy and I believe that's still really important from a relational integrity perspective. You get quite a lot of people who. Prefer the sort of indem key. So basically the scenario where the importance of surrogate key is this idea that it's almost like abstracting the business entity away from the business keys.

The fact that you're basically just having something, a key that identifies a particular record that's unique to your data platform. When you get into the realms of things like type two dimensions, it's really important to have a surrogate key because you're gonna get duplicate business keys or natural keys in there.

'cause every time the record changes, the business key's not gonna change. But you need a new unique key in your data warehouse. Some people are funds of in Keys these days, there's this idea of hashing values to produce a key that's gonna be pretty consistent. There are some drawbacks with that. The hashes aren't always perfect.

You can end up with clashes. I was always coached on the just the incrementing key type approach, and there's loads of people who argue against that these days and what. I've, my experience so far is that people have gotten a bit lazy by using the Indem key value because they can create them on the fly and they're like, oh great, we're gonna have perfect relational integrity because it always, when we use this hash, it's gonna make all the records match, but they're still not checking that they've got that integrity between their data sources.

So you can still end up with missing records and mismatches anyway. So I tend to use the old school surrogate key 'cause it at least forces you to go back and look at your dimensions and make sure that your values exist and that you've got things like unknown members and things like that. I feel like I'm going down like another rabbit hole of just said lots and lots of phrases and.

I take for granted that I know what they mean.

Shane: But the good thing about Dimensional modeling, and one of the reasons I think it is still popular, so popular is Kimball and Margie Ross, they wrote a hell of a lot. They write lots of books that were easy to read and understand. They wrote lots of blog posts, which became books.

Some of those old tips and tricks, we should probably archive them so we don't lose them. If those sites ever go down, they give us really good examples of, if you have this problem, this is how you deal with it. I'm a big data vault fan and we haven't had that gift in data vault land. If you read any of the data vault books, they're pretty dire.

They don't explain things well, in my view, 'cause this is my personal opinion, even though I use the modeling technique all the time, all the tips and tricks, there's very few of them that are usable. And lots of people have tried. So yeah, I think part of it is if you wanna understand what a surrogate key is, get the books or go read the blog posts and it will explain them in infinite detail in a way you can understand.

I think one of the key things you said was the surrogate key Pattern. This Pattern of saying, I'm gonna look up unique business key and then I'm actually gonna store another unique key and maybe an incremental inte key that's just incremented up by one to say in the data warehouse and the data platform that is now the identifier for this customer, this employee, this supplier.

We definitely use different technical patterns. Now can business key, I think is a viable Pattern with the technology we have today. Hash keys. Yes, we have collisions. That is a problem we need to worry about. How often do they really happen? Would we even know? We wouldn't know. There is technical implementations, the ability to wrap.

Trust patterns around it to say, if I've got a key in my Salesforce and I've got a key in my operational system in my Salesforce, it's. Business key one is Johnny, my operational system business key one is Shane. I have to actually do some logic to say that I can't just slam those keys and put them in the same place.

I have to make sure that they're unique. Otherwise I'm gonna do a whole lot of bad things, whether we surrogate, whether we hash, whether we can cabinet business keys, who cares? Pick one that works for you for the technology you're using and then make sure it's accurate, it's trustworthy, it, it's got all that rigor around it.

'cause those are the patterns that actually really count. So then let's go back to the SCD two. So I've got Shane the consultant and I've gone on for my lovely two weeks holiday because I'm working for a consulting company and I actually get leave and then I become Shane, the, the company owner, and I go and take my one day weekend that I get forced to take by somebody else for the year.

We have a bunch of techniques on how we know that record has changed in dating versus windowing. So do you wanna talk through those?

Johnny: Yeah. So in terms of the way I've always done it, the terms of all the implementations I've seen, it's always about having that valid from valid two date or a start date and end date that can be supplemented with it and, and is active flag potentially.

So you always know what the current one is. I've seen different people do them different ways. I've, I personally, for me, having a definitive start date and end date happen works really nicely. One of the patterns that I hate seeing is people who join on the years active indicator. 'cause if you happen to be loading a historic record, if it's a backdated payment, that needs to go back to when you are working as a consultant.

But we. Load it based on the active flood being business owner. For me, that's an inaccuracy. So for me, I would always sort of time bound it between my active windows, and I always hate leaving a end date as null as well, because then you're into the realms of casting that as some high date, just to be able to figure out your date ranges.

I guess it's one of the interesting things about the Pattern is even the patterns themselves have patterns within them. There's that Pattern at a conceptual level, and then there's even other patterns at a implementation level that can differ as well. But yeah, type two. You basically got an active from date and active two dates, what period of time a particular record was

Shane: valid for.

And this is the key is that there's the logical modeling or the conceptual modeling of Dimensional. I've got a bunch of dims, I've got a bunch of facts, and then there's the physical implementation. How am I doing my end dating strategy? So as you said, I was gonna ask you, are you, have you start end dates and if it's a current active record, is you end date an or do you pick a a 9, 9, 9, 9?

Yeah, that's always interesting. And then we got this idea of windowing, which kind of came outta the Hadoop stage. So when we moved from relational databases to Hadoop, one of the problems was actually changing. A record was incredibly expensive. Yeah. So if we landed a record and then we wanted to end date, it actually, it wasn't performance.

So we moved to this upset model, this idea that we wanna be insert only if we can. And so that forced us down the idea of using a Pattern of windowing. We don't ever have the start date. And then whenever we run the query, we'd go back and say, run a window function, and tell me between this period, what is the effectively active record.

And that was an expensive query. So we just gotta make choices.

Johnny: I've seen window in, it wasn't even in a Hadoop implementation, it was just in a relational database where somebody hadn't done an end date. They'd just say same rationale, but for the technology, probably the wrong choice. And even for that, it was an expensive query, which is why I prefer not to do it.

But again, I, I've missed that area to an extent. I went from relational databases and then just leapfrog straight to lake houses and open table formats. And so with your Deltas and your icebergs, that kind of updating historic records. I don't wanna go down the rabbit hole of explaining how it happens, but functionally you can, even though sort of the way under the hood it works is insert only,

Shane: yeah, it is still an update of records in those technologies.

It's still a technical anti-patent for the technologies. It's just that they've worked out ways to make it work and you shouldn't care. You may get a query go from two seconds to 2.1 seconds unless you really care, which you don't at that level. So you just gotta choose a Pattern that works for you. You can do start only dates and windowing and then create views on top, and the view can give you a, an end date on the fly.

There's lots of choices. To make it easy for the end user versus performance, you just gotta pick one. The interesting for us is with our product, the end. Because it's way cheaper. With the petitioning strategy that we use with BigQuery, it's way cheaper for us to end date the records than it is to have a windowing function.

So when we looked at it and cost was key for us, we said, it's a trade off decision we're gonna make. We're gonna go touch that record because it saves us money over everything we do if we use that Pattern. So let's pick one. And then the other key is when you walk into a new site. You now got a whole lot of questions you need to ask yourself.

Are they dimensionally modeling? Great. Okay. What's the typical grain of the facts? Is it transactional? Yeah. Are there any aggregate facts? No. Okay. What's their DIM strategy? It's primarily SCD type one. Okay. How do they make a decision when they need a type two? Okay. When they do a type two, what's their end dating strategy?

Are they relying on active flags? Are they start in dating? How do they fill out the inundated field? Is it a null? Is it a specific date? Are they windowing? Where's the query for the window? Is it left to the user? Is it in Power bi? There's all these patterns within patterns that are known, but as soon as you walk into a place you've never seen before, you have to ask all those questions.

'cause you need to know, you need to follow those patterns. Heaven forbid you have a dimension where one dimension is inundated and another is windowed. Yeah, you, you have to say, what the fuck, what? There may be a reason, but I'm gonna ask really hard questions about why are we using a different end dating strategy for DIMS in the same data platform?

There's gotta be a reason for that. Apart from, oh, a new developer came on, that's why they did it. That would absolutely

Johnny: fry my OCD. Just in terms of consistency. I can, I can get quite opinionated on what my preferred patterns are, and I don't mind being challenged on those, and even the patterns I prefer have some trade offs, and if those trade offs aren't the right trade offs for a client, then it's, okay, great.

If you're wanting to optimize for something. Different toward the kind of default that I try to optimize for. That's fine. But yeah, consistency's gotta be key. It's great though, that little spiel you just went on in terms of all those things you've gotta think of. And in my head I was like, just faxing dimensions dimensional modeling, just faxing dimensions.

Shane: Well actually, I'll tell you what, you've just got me thinking again, so you know how we're working on that data layer checklist. Yeah. I'm wondering if there's like a dimensional checklist when you walk in as a consultant to a site you haven't seen before. You just have a little checklist where you go through and you say, here's the patterns.

It could be, let me just tick some boxes so that I'm thinking about it. The checklist is just helping me think to ask those questions. And I'm like, yeah, I've gotta reset because I've come off a gig for another customer that's got a slightly different Pattern and now I've gotta reset my brain. And as a consultant, that flipping between organizations, that change of slight variation in Pattern sometimes.

It gives you a problem because you haven't reset your brain to the Greenfield.

Johnny: Yeah, I've struggled with it recently with a client.

There's a couple of things that they've done. First of all, are you familiar with this? If you were to look it up on the internet dimensional modeling and fact tables, the common answer say there are three types of facts table.

You've got transactional, you've got accumulating snapshot, and then you've got snapshot. Transactional is almost as it is. It's just pretty much a pen. Only as a new action happens, you keep adding to it. Accumulating snapshot tends to be more sort of process based. If we took booking this podcast as an example, I think you probably asked me in about March.

So you'd create a record that said Shane invited guests for podcasts 1st of March, and then the ball was in my court in terms of committing to a date, and I sat in my hands for a long time, and I think last week was when I was like, let's do it. So that would've been mid-July. So we'd update our record to say, date invited.

Date confirmed and then the date it happened, argument's sake. So 21st of July as it is now. So that's one record, but we've updated it three times. So, uh, accumulating snapshot has been updated based on dates. A clean snapshot, you just take a full date dump every single day. So things like stock balances are normally quite good for that.

Bank balances, maybe things like that. Finance teams are quite keen on the month end closing balances and things like that. So you take a snapshot of the data at point in time. So you've got those three types, but actually there is a fourth type that just doesn't get mentioned very often and I think it's 'cause Kimball only thought about it as a bit of an afterthought.

People refer to it as a type 1 4 5 fax table. So have you heard of this Type 1, 4, 5?

Shane: No, I haven't actually, but I'm like, wow, what a great name you already for. You already blown you had to give a thing a stupid name. Yeah,

Johnny: yeah. So the reason that, to be fair, Kimball didn't call it a type one four five, but it's just data models have, have gone with that because all of the supplementary information on the Kimball Group.

All of the blogs are numbered. So article number 1 4 5, if you were to look it up, basically refers to this fourth type of fact table, which is a accumulating time span snapshot. So we've basically taken two outta the three types of facts table and mashed them together somehow. The other thing that people call 'em, which really grinds my gears 'cause it's complete oxymoron, is they'll call it an an SCD type two factor table.

And it's okay. SCD is dimension tables and this is a fact table, so it can't be a type two, but the reason people get confused with it, 'cause it works the same way, it's like a process driven type fact table whereby the status of something changes and you get a new record every single time, but it's a change in status.

They're really good for processes. Things like a sales funnel or something like that. If you've got a particular opportunity that might be you get a lead and it's a lead between the 1st of January and the 1st of February, and then the 1st of February it becomes a qualified lead and then after two or three weeks of talking, it then becomes a proposal.

And then maybe it becomes a sale. So you've basically got the same record that goes through four different statuses. And then rather than capture that as making it wide and putting extra columns in to represent each stage of the process, you basically create new records and you have a start date and an end date.

A associated with a record, like a type two, slowly changing dimension.

Shane: That's what I call an administration event change. Invoice entered, invoiced, reviewed, invoiced approved, invoiced, paid. 'cause normally it would be a bunch of columns. And now I've got a whole lot of SQL Magic I've gotta do if I want to go and grab those columns and make them rows for whichever bi tool I'm using.

So what you are saying is effectively we get a new row for that dimension key. When there's a state change, we have a date for that state change. What about event slamming? So let's say I've got a, an event of something ordered. I've got an event of a payment, I've got an event of a delivery, and I've got an event of a refund.

In my understanding with standard dimensional modeling, I'd have four fact tables. I'd have an audit table, a payment table, a delivery table, and a refund table. Is that true?

Johnny: Oh, it's classic consultant answer at this point. And you definitely, it depends. My personal preference, when I design, and this is not so much a Kimball thing, but more of a Lawrence core and beam type thing, I always tend to conceptually model them as separate events and then having modeled them as separate events, if they're conformed dimensionality.

There's another great soundbite for you, conformed dimensionality. It's on my list. Yeah. If their conformed dimensionality is identical or so close to identical that it's not gonna make a difference. And if their grain is also identical, then I may well model them as a single table. But it depends.

Shane: Okay.

But if I had a fact table, that thing is a thing. One fact table and in my whole warehouse, 'cause one fact table, that's every event, all at the lowest grain. That's an anti-patent with, with dimensional modeling, we don't have fact tables that are thing is a thing. So let's say I've got a fact table with three keys.

Thing one thing two thing three. Yep. And thing one goes back to a dimension. And that dimension is a bunch of keys and then types. So that dimension holds employees. Suppliers, customers. And there's a typing on it. And then thing two is the event. So it goes back to another dim, and that DIM holds all the records and they're typed by order, payment, refunds, delivery, packing, all those kind of things.

And then we've got thing three, it's some kind of location dim. And so in, I go back to a dim table and I've got a bunch of keys and it's store every store in the world and then every website and every URL. So I'm modeling at the highest level of extraction where I've effectively got three dim tables and one fact table.

And everything's a thing, is a thing that is an anti-patent for dimensionally modeling. Dimensional modeling is designed to pick up some of the business language. Yes, I should look at a dim and actually understand. That's something I can look at and understand that it holds a bucket of things that are different from a bucket of other things in my organization.

Yeah, I agree. So do we still see dimensions of people, which is customers, suppliers, and employees? Or would you tend to model a customer dim and employee dim and a supplier dim?

Johnny: It always goes back to me to an extent in terms of speaking with my business users and how they reinterpret them for that kind of example, like almost unquestionably they're gonna be separate dimensions.

That's gonna be an employee, supplier and a customer. Yeah. Almost undoubtedly that's three separate concepts. Likely coming from three separate, possibly separate systems, but certainly if they're all in one system, separate tables. Where it gets fun is when you get into the realms of, so example I'm working with at the moment, and we've got this idea of a researcher dimension and a researcher is actually a subset of employees, but it's okay, should we still have an employee table and a researcher table?

We, we typically tackle things like that with role playing dimensions and having an employee dimension that can be filtered and may be obstructed in a view as different types and things like that.

Shane: Yeah, and it's the same with suppliers and customers. Do I have an org dimension that is effectively role playing customer supplier?

It is choices. They're, as long as they fit the dimensional modeling Pattern, you've got a dim, you've got a fact. You're deciding the grain of your fact. You're deciding what form of society changing your dim is. You've decided how you're gonna manage your keys. It's okay. Right? You now just come into some choices about which Pattern works best for you.

Let's go back to a couple of those things you talked about. Let's go through conform dimensions first and then role playing dimensions next, because they're key terms that we'll see in dimensional modeling all the time.

Johnny: Conform dimensions. So this idea that you can reuse the same dimension across multiple facts.

So if we had a sickness data model and we had. Periods of sickness, and we had our employees so we knew which employees had been sick. You could use that employee dimension with your sickness facts and use that as part of a modeled business process. And then you can do analysis on that. If you had a completely different business domain, let's go with sales, and you wanted to be able to analyze sales, and you wanted to be able to see which employees had sold the most of a particular product, you wouldn't remodel your employee for that sales domain.

You would use that conformed dimension. So this idea that this single entity can be used across many business events effectively. So the classic ones date. Like a date Dimension is, I think not far off every dimensional model ever. Design always has some kind of date dimension attached to it, and you're likely just gonna reuse that date dimension, be it a finance domain if you were doing financial reporting or if you're doing sales reporting, or if you're doing marketing or anything HR related.

Writing and maintaining a single date table means basically that date dimension's conformed across all of your potential facts.

Shane: And that's one of the things that's, I know if it's unique to dimensional modeling, but it's definitely one of the things that you do in dimensional modeling that you don't tend to do in some of the others.

I'm data vault modeling or I'm activity schema modeling. For me, dates are just attributes of a thing. I don't hold a hub and set for dates and data vault. I don't hold dates as secondary key in activity schema. So in dimensional modeling you will typically see a date D, which holds that is, and it's used across every fact table.

Yeah, so another kind of unique thing, I dunno if it's unique 'cause I haven't actually seen every model in the world, but definitely a Pattern of a dimensional model is a date dim and role playing. Talk me through roleplaying dimensions.

Johnny: Yeah. Roleplaying is the idea of a single dimension that can be reused for multiple different things in different contexts.

Date is actually quite often a role playing dimension because we took into account the example we talked about before, where you might have an order date, a delivery date, a refund date. Typically, you wouldn't create three specific date tables relating to that. You'd have one date table that you can relate to your fax tables via.

A variety of foreign keys and you'd reuse that single dimension in different contexts. Dates a really typical one, but even things like trying to think of good ones. I recently, I did a supply chain type thing where location was really important, but it was always from location and a there, or a shipping from one location to a different location.

But actually they could always be going in either direction. So rather than having a from location and a two location dimension, we just had a location dimension with everything in it. And then we just roleplay to it, depending on the context of whether it was the destination or the origin.

Shane: And so if I look at the pen itself, it's effectively in our effect table.

We have two columns with keys, but both of those keys come from the same dimension. And it's the typing of the dimension. So another one would be, if we ever wanna see customers and suppliers in the same fact, we might have an organization dim and it's typed by customer, supplier. And then we'll see those keys in two different columns in the fact table.

Possibly let's go out through the rest of the weird dimensional thing. So Junk dimension. Talk me through that one. That's a cool name, right? That's better than fact. 1 4 5. Junk Dimension.

Johnny: The junk dimension. At the same time it's a terrible name 'cause it makes it sound really throw away and not particularly valuable.

So going one back from Junk Dimension. Have you ever come across the concept of a Centipede factor table or is that a new one on you? That's a new one on me. Cool. So a send speed facts table is where you end up with a facts table that's got lots of different context around it and the context is very specific and granular.

So you end up with a wide fact table that's got lots and lots of dimensional keys in it. And then the dimensions, it's joining to end up being very narrow dimensions with only a few attributes on them. And you end up with basically to write any given query having to do lots and lots of joins all over the shop.

And they call it a center speed fax table because if you think about an ERD, the fax table ends up being very long and thin with lots of relationships coming off it like little center speed legs. And they're difficult to use because your dimensions are spread across so many different sort of categories.

They're difficult to navigate. You've gotta write these really unwieldy SQL queries where you've got lots and lots of joins and they only tend to be like joins that are one hop. We're not talking snowflake. And so they're not that bad from a performance perspective, but from writing them they become unwieldy and you can cure that.

By, there's a couple of things really. If you've got any sort of commonality between those lots and lots of small dimensions and you can perhaps denormalize them into a single entity that works nicely, but sometimes you just can't, and almost the way around that, the idea of this junk dimension is to take all those little titty, bitty, low cardinality type things and actually just create sort of the, the product of them all into one fat dimension that combines all possible given combinations of them so that you've got this one entity that you can navigate through.

It's almost like a miscellaneous dimension is almost one way that it feels could be a good way. In fact, miscellaneous dimension for me feels like a better description than a junk dimension, but it's almost this kind of just group all these things together because lots and lots of little. Homes for them doesn't make much sense.

So we're just gonna stick them in one big group together and allow people to query and buy that instead.

Shane: And so how does the key for that work? Because now we've got a bunch of things that kind of aren't the same. So we're still gonna surrogate it with an incremental key, but the business key's gonna have no real relationship.

Johnny: The business key ends up being just a composite key of every column, and that's how I've always done them. It's any given combination of a set of different things. I'm trying to think of a good example of it. So I would say in my life, I've only actually implemented junk dimensions. It might only be like twice.

I don't find it a particularly common Pattern, but I'd know how to do it if I needed to do it. And I think I'd recognize the need for it if I saw it as well. And sometimes I struggle to get my head around why I would understand that. I'd see a centipede fact table and be like, there must be a better way to consolidate this and make.

Easier to navigate.

Shane: So it's in your toolkit. It's not an anti-patent. Yeah, yeah. It's a Pattern you use, but you use it very rarely. It's only when you go, ah, actually this is gonna cause me a problem if I do it the normal way. Let me use that alternate Pattern and bring it in. And then degenerate dimension.

Johnny: Yeah. So again, degenerate dimension is like another one that I was gonna be on my blog series in terms of, oh, it's just facts and dimensions, isn't it? So degenerate dimension, different people seem to have different interpretations of them. My interpretation, effectively, it's a dimensional value, it's context that would ordinarily exist, auto dimension, but you just retain it on your facts table instead.

So you basic. Get rid of the need to join out to a separate table to use it. The reason for it being a piece of contextual information that is only ever relevant in the context of a given fact. 'cause at that point, if that given dimension is never ever gonna have any kind of conformity with all the things, then you may as well just eliminate the need for it.

Again, if I give you that description, theoretically you could end up with very wide degenerate dimensions, which I definitely wouldn't recommend. 'cause then you end up with a wide facts table. So if it's only a small dimension where the context of it only applies on a given facts, then yeah, you use 'em as degenerates.

Again, this almost feels like a bit of a personal preference type thing, but it's a Pattern. I really try and avoid where I come. 'cause the typical thing that happens for me is that you go through the requirements gathering and you come across this particular concept and it feels like it's a good fit for degenerate dimension.

And then you go to your customer and your client and you discuss it. Conceptually with the mental, does this thing only ever exist in this context? And they say, yep, absolutely. So you do it as a general degenerate dimension, and then the next requirement comes along and all of a sudden, oh yeah, we need that piece of information relating to this different factor, a different grain as well.

And at that point, people start trying to join facts tables together, which is a discouraged practice as well. I always try and even if it's quite a high cardinal analysis, even if it's gonna be almost one-to-one with a factor table, I will try to avoid degenerate dimensions if I can. Just, just promote that reusability and all the processes

Shane: at a later date.

Again, it's not an andan, but it's a Pattern that's used. Really, it's an exception. You have to justify why using that Pattern. It's a Pattern that I use rarely. It's a Pattern that I see everybody else use all the time, and then I bang my head against the wall. One of the problems with degenerate dimensions is you think about it as you are querying the data.

You're a user coming in and you don't have a semantic layer, so you're hitting straight against the dimensional model as your semantic layer, and now you've got another rule. So the first rule is you join your DIMS d fx. Grab your fact table, you join it to the DIMS you want and you're effectively just creating a de-normalized one big table is what you're gonna get back with those queries.

And then you are saying, oh, but actually you probably need to check the fact table to make sure there's no degenerative dimensions, because if there's an attribute you need, and it's actually in the fact table, not the dim, now you've gotta go and actually do something different to your query. So you are changing their query Pattern from the standard Pattern to a Pattern plus, and now they've got to remember to check whether that attribute's sitting in the fact table or not.

Versus attributes always sit in a dim. And so again, we don't have the ability to template our code as much. I suppose it's only a second Pattern, right? Run this query and it's got degenerates. But why, again, why are we justifying, slamming the attribute on the fact table when attributes go on a dim, maybe back in the days when we had constraints, but we don't intend to have those now.

Johnny: Yeah, I mean, the other sort of fallacy that I hear off spouted is that in the world of spark engines and parallel processing and those kind of things that oh, it's more efficient to keep it on the flat table. And I get the impression that maybe that was true. Four or five years ago. But all of the major players these days are optimizing their engines for that bi workload, for that kind of star schema type shape, and it, it deals with them absolutely fine.

I've had it a couple of times recently, it's, oh yeah, we need to reduce the number of joins, and it's like, why the engine can deal with it now. It's not as big a problem as people think it is.

Shane: I think, again, there's lots of those patterns where we have a preference and we try and justify it for the technology that we built that preference from.

But things have changed. So test it. Just run the experiment. See, simulate the volume of data you think you're gonna have and the types of crew you're gonna run and run them and see which performs better. Okay, so you, you talked about joining across facts. So again, there's this idea of was it a bridge table that allows you to slam.

Facts together or join them. I never quite remember that one. Talk me through. Yeah.

Johnny: I've never used bridge tables to span facts. 'cause ultimately that's just a conformed dimension that's effectively, Kimball will talk about this idea of drill across. And the way to drill across is, again, it's really strange.

I found myself in this position where I've lived and breathed this stuff for so long. I find it sometimes difficult to put into words exactly what's meant by some of it. Kimball puts it into a really great description in terms of this idea of drilling across factor tables and this idea of have you come across things like the fan trap and the chasm trap and all those kind of things like that.

And that fact tables tend to be at different grains, and if you join them together, the cardinality is not gonna match. But if you basically route your queries and your dimensions and then aggregate across facts, it works absolutely fine. It's always gonna depend on how the fact tables have been structured to an extent.

But as a rule, if you basically structure your queries that way, that's how they're designed to work and how they're decided to drill across, and that's what you conform dimensionality does. Bridge table's an interesting one. Bridge tables are more, again, one of the patterns I talked about and when we talked about defining grain, this idea that for every unit of measurement in a fax table for every transaction, it should have a one to one relationship with its dimensions.

It's not always the case. The classic example is bank accounts. So I have a joint bank account with my wife. If a bill goes out. On a direct debit for a mortgage, for example, that's two customers associated with it. It's one bank account, but it's two customers. So you've almost got to have a bridging table that can resolve that so that your fax table only has one transaction and it can basically join through a bridge table and resolve out to those two customers.

So that's where bridge tables mainly get used. So I went down a real rabbit hole, and again, this is for me the fact that I take pride in this sort of amount of experience and knowledge of a master round dimensional modeling, but I'm definitely not encyclopedic. So it's still for me, hang on. I've got a problem.

I dunno how to solve this, but I have got a shelf of books and in those books are patterns. So the one I was trying to solve recently was around hierarchies and the best way to resolve recursive hierarchies. You can just flatten 'em out, which kind of works. If they're not fixed depth, that can get a bit messy, but, but.

Does work, but there's also patterns you can use with bridge tables that will basically help you resolve recursive hierarchies as well. So that's the idea that we're talking about resolving many to many type relationships, and you basically end up that every level of the hierarchy becomes its own record, and then the bridge table resolves it back to the fact.

Shane: Okay, so bridge tables are a bridge between the facts and the dims. Not a bridge across facts is what you just said. That's how I've always used them.

Johnny: I feel like I'm gonna have to get my head in my books and see if I can find any examples of bridge tables between facts.

Shane: I can't even remember what they were.

And then I just, I just think about bridge tables from data Vault and then, so I'm, I'm applying a different Pattern for that name in my head.

Johnny: I can say it's so bizarre. My first foray into data platforms after having been a report analyst was Kimball. And every subsequent organization I've worked in, and I think you alluded to the fact that it's probably the number one applied Pattern in the data industry.

It's all I've ever known. I've got real blind spots for data vault. I can talk about hubs and satellites and vaguely sound like I've, I know a little bit, but I would not have. The foggiest how to start, they've had to do a data vault implementation,

Shane: and it's a different language and it's the same patterns to a degree.

So with data vault, effectively we just take the dimensional key and we make it a hub. So it holds the key only, and then we take the attributes out of a dim and we make those SATs, but we can add more than one SAT table for a hub. So again, it's a very similar Pattern, but it's different and the language is different and the patterns are slightly different.

And so you've gotta reset your brain and then you go to activity schema. And that's, again, it's very similar but very different. And again, like you said, there's uh, the depths and the breadth. You can do dimensional modeling with some of the core patterns. You'll do really well and then eventually you'll hit an edge case where you need some of the more of skill bands you need to know they exist.

But as you said, there's some good books, there's all the Kimball books, there's all the blog posts. And then with Lawrence Coors stuff, half his book is Beam, which is the stuff I use around the who does whats and effectively understanding requirements and concepts and conceptually modeling. And then the other half of his book is how to apply all that to Dimensional modeling.

And I remember in the course there's quite a large part of the course is around ragged hierarchies and yeah, how you model those in dims and I'm, I don't care. I'll just use an OLA queue back then. It takes care of it for me. So yeah. Alright, let's get onto what I think will probably, the last one I can remember, which is Factless facts, which kind of sounds dumb.

It's kinda like factless facts. Yeah, totally.

Johnny: I'm trying to remember who I was talking to this about the other day and we just basically decided that's just a nonsense, not just thing is a factless fact table. The fact table still has facts in it. It's just that there are no values that you're gonna a.

That you like. So the fax basically represents the intersection of all the dimensions and ultimately you end up pretty much counting the rows. That becomes the measure at a given intersection. Quite often when I do basic star scheme examples, I'll fall back on a sales example. It's in online retail for a long time.

And so a typical sales fax table might have sales amount, it might have cost, it might have margin, it might have order quantity. These are all things that you're gonna be able to add up an average, things like that. A factless fact table doesn't have any of those things. It basically just stores the intersection of the various dimensions and ultimately end up counting rows on it to get the facts.

Shane: With a factless fact table, would it just hold the keys or would you have a column with one in every row?

Johnny: I would just have the keys again. So I've seen that Pattern as well when people just put a one in. So from a power BI perspective, that's considered an anti-pattern that's just bloats that you don't need.

It's just a count of the rows. It's you don't need the one 'cause you just do account. Account star or

Shane: account one. So let's go into that for a second. So everybody tells me, not everybody, but most people tell me that Power BI is far more efficient when you use the star schema versus one big table or anything else, but they never gimme the context.

And I'm like, is that when you're not using direct query and you're actually bringing the data back into the Power BI layer? Or is that when you are using DAX or when you're creating a semantic model? Where is the thing that says Power BI works best with a star schema?

Johnny: So I feel like I'm gonna have to shout out a couple of Microsoft Oh Form MVPs now I think.

So there's a Chap Kon kbi. Dutch guy and his catchphrase is that you must star schemer all the things. And he's like a massive advocate of it. And he made stickers and t-shirts and all these things. Star schemer, all the things. And then there's a chap bent, I'm gonna butcher his name, which he'll kill me for if he listens back to this.

So he is Belgian Benny Dre, he is on the power bi cat team and he did a conference session that was taking the mick out of K bit. Basically the session was called Star Schemer, all the things. But why? And it was really fascinating 'cause it actually dug into it and did a load of tests. So some people were like, oh you need to use a star schemer 'cause you'll get better compression out of the, so Power BI is built on the Verta pack engine, but it's basically the compression engine for it.

Oh. You get better compression so you'll have a lower memory footprint and you a load of tests versus one big table where you basically proved, yeah that's not really the case. And it was like, oh start scheme will load quicker. 'cause if you've not got less data redundancy and you did a load of tests and it, that wasn't really the case.

The main thing is that DS is a language. Is structured to work with it. Well, it always strikes me as chicken and egg like in terms of is Dax built to work with star schema? Almost That kind of, what was the way around? I was thinking about it the other day, that effectively, did they make power BI to be optimized for star schema as opposed to people saying that our star schema is what's optimized for Power bi.

The other soundbite I came up with the other day is that Dimensional modeling's, probably the second best Pattern for everything, speed-wise. One big table, a former colleague who did his thesis on it, star schema versus one big table, and I was like, oh yeah, one big table is loads better. Was like, why is it better?

It's quicker. Okay. What about from maintenance and reusability and then when you get into the realms of that, actually, yeah, one big table's good and quick and easy to query, but I've got to have a different, one big table for lots and lots of different things. And then if I need to update a particular attribute, I'm gonna have to update it in lots of different places.

And then again, not being an expert on it, but my impression of data vault is that if flexibility wise, it's really good. It deals with change lots really easily. People say it's complex to implement, I've never done one. And that potentially querying it because there's a lot of joining. Your tables can be quite complex and difficult to navigate from a analyst perspective.

Shane: Yeah, so let's talk about that one. 'cause that's a really interesting one. And it's true. You typically use data vault in a layered data architecture. What Joe Reese calls mixed modeled arts. So we would typically never expose the data vault structures as the core reporting layer. We would dimensional it with one big table it, we would activity schema.

We'd do a whole bunch of things to make it easy because joining lots of tables together as an analyst is an anti-pain, in my view. You're getting 'em to do work that the machine can do for you. And then everybody goes, oh yeah, but everybody's can understand how to query a star schema. And I'm like, yeah, they can if you train them.

Yeah, and they can still get it wrong if I give them one big table, if I give 'em the table with a grain and all the columns in it. As long as I don't give them 2000 columns, it's much easier for them to query. Now. Yes. What happens now is if my only Pattern is one big table, if I write 5,000 DBT models that do all the transformations in code with no segmentation layering, no shared reuse, no shared context, and I'm creating nothing but 10,001 big tables, that's a bad Pattern.

But if my code is effectively my model, the context holds the model, and I'm just hydrating the one big tables at the end, and every time I do a change, those tables are automatically refreshed with those changes. Not a human, then that's a Pattern that works. If I'm data folding and then dimensioning, and then one big tabling.

It's a Pattern. I can automate that and I can hold context and I can get the machine to do all the changes for me when I change that context. So that's my view. And I was always intrigued with power BI dimension. That's the norm. And that's fine because with the norm there's lots of good articles, there's lots of things that have been written.

There's lots of people that can help you if you get stuck. And if you don't follow the norm, it's a little bit trickier, but why is it the norm? And the same with DBT. For people that now consciously model rather than unconsciously model dimensional seems to be the flavor they use the most. Now why is that?

Is it because the information is easily accessible? Is it because that's the things that people are being trained on the most? And our days Dimensional became so popular because Ralph Kimball did a lot of training. You could always go on a dimensional training course. It was easy to get hold of somebody that would teach you that.

Not so much with the other courses, but he doesn't do the training anymore. Yeah, he's been retired like a long time. Yeah. And then Margie Ross took over, but she's retired now. So actually, who's doing the training? 'cause it's not the people who invented the patterns, but maybe that's it. Maybe the training is still more accessible or the books are accessible.

It's intriguing. How is that still the modeling Pattern and it has value. It's a really valuable Pattern. It's not my favorite Pattern. I'll be honest about that, but that's just my opinion. Like you said, you, you don't use a lot of junk dimensions or degenerate dimensions. That's just your choice. That's how you mod.

Yeah, and that's fine. You're making conscious decisions around that, which reminds me, there's what I missed a late arriving facts. Sorry. That was the other one that we probably need to talk about. So

Johnny: the way I always interpreted them was this idea that you are, and, and I guess this is a Pattern that we've not really discussed, is this idea that you'd always load your dimensions first and then load your facts afterwards.

And part of that is to guarantee you've got that sort of relational integrity. And again, this goes back to that Indem key argument and the fact that I prefer to look up my keys after the event. So if you're gonna look up your keys when you create your facts, tables means you've always got to load your dimensions first.

But what if in between you loading your dimensions and you loading your facts, a new dimension occurs? So we sell a brand new product, so we load our product dimension and that's got every product that exists. And then. Whilst that is happening, a new product goes live and gets sold straight away. So then basically when loaded the facts, it's arrived late because it's arrived after the dimensions have been processed and it doesn't have a matching record back in its dimension table to be able to join to.

You deal with that with an unknown member. So basically a, a default key that gets assigned where the dimensional record doesn't exist. And then I'd always into the realms of rolling windows from my updates. 'cause then I'd go back and revisit my fax table and update the key. Which then for me, that is contrary to this idea of a transactional fax table.

That should be right. Only 'cause that's not true. You'd still go back for a later item. Fax and update Its dimensional key

Shane: is that effectively we get a fact turns up and there's no dim for the fact we'd normally do a placeholder of dim, don't we? So there's a dimensional key. 9, 9, 9 9 9 or zero minus one.

Pretty much minus one. That's right. Used to be the big argument wasn't there about what do you use as the surrogate key for your late arriving fact dim who used to argue about that all the time. So effectively the fact turns up the dimension isn't there for whatever reason you bind it to this Yeah.

Dummy dimension key. And then you go and update the fact later when, when that dim actually arrives. And that way you get consistency or reference integrity across the, the dims and the facts.

Johnny: I, it's strange to an extent 'cause my love of dimensional modeling, definitely. Predates even the invention of Power bi.

But having turned into a bit of a power bi, not, I think that's helped me go deeper and further in understanding all this stuff. And that goes back to me for, people would argue that the, that the idea of the Indem key was that actually, 'cause if you do that in your facts, say well you don't have to go back and update it afterwards.

If it's a late arrival fact, it doesn't matter because the key's already been predetermined so you don't have to worry about it. But yeah. But if you can imagine you were doing nightly batch loads. That's a whole day where you've got a relational integrity issue. And from a Power BI perspective, that can have quite a big impact on your dax, which you've not got proper relational integrity.

So that's one of the reasons I always prefer to do the lookup. 'cause then with the lookup you to fall back on your unknown member if you need to.

Shane: It's the key, isn't it? Is that these weird patterns are there for a reason. Because when people have been using this in anger for 20, 30 years, they have found these edge cases that they needed to have a patent to deal with because they would turn up every now and again and if late arriving facts is one of those.

Johnny: This is one of the sort of debates we ended up getting into with the engineering teams I've worked with in terms of, and it goes back to something again, Joe Reese talks about is trade offs and understanding what it is you're trying to optimize for, because. Your pipelines would be more efficient and run quicker if you don't have to do that dimensional key lookup.

So your pipelines are run quicker. If they're run quicker, they're gonna be cheaper, your data's gonna be more available. Okay. But if my data's available with relational integrity issues, then it's not accurate data. And I'd rather do it slower and a bit more expensive, but have it accurate than have the most cost efficient pipeline.

Because you don't think looking at my dimensions is the efficient thing to do.

Shane: But also a, again, a lot of the patterns were around technical constraints. 'cause I'm sure I remember really in the early days of Oracle when we had foreign keys on the tables between the facts and the dims, the updates were really slow.

'cause we had on-prem servers and we were constrained around memory and dis and all that kind of stuff. And I'm pretty sure we used to do an update, drop the referential integrity, drop the foreign keys, and then load all the data, making sure that we kept refer integrity by the code in theory, and then we'd reapply the.

Foreign keys at the end of that process and hope like hell bloody rebuilt because that just helped us get the load times down from 12 hours to two hours. Now you would never do that. Now that I know. I mean, half the cloud analytical databases don't have foreign keys for that very reason. But I'd be surprised if anybody's doing that Pattern now.

We, uh, altering the tables, removing the reference integrity, doing a load and then putting it back on.

Johnny: Yeah, so what's interesting, again, another debate in terms of happens, but even in my on-prem days with SQL Server in a data warehousing context, we never actually applied throwing key constraints. Never.

We always just loaded it. We. Dealt with them with logic. So whenever we insert this record, we're gonna check it's got a key that matches and if it doesn't, we're gonna put the unknown member in there. So battling against the constraint checks wasn't a problem. I guess when I talk referential integrity, that's probably me using my power bi conditioned brain.

'cause I'm not talking about the actual database constraint, I'm just talking about the effects that the database constraint would make, if that makes sense.

Shane: Yeah. Referential integrity is actually a Pattern that says everything has integrity. Then in my head I just naturally go back to databases, apply it, and therefore whenever I use the term reference integrity, I'm falling back to that Pattern of it's a technical implementation, not a logical one.

And I think that's interesting is somebody said to me the other day is I often bring up ghosts of data past, and it's intriguing because now I start to think about patterns and I go, where did that Pattern come from? Was it a ghosted data pass for a technical constraint? Was it a process constraint? Was it a people constraint?

Was it an edge case that the patents don't deal with and therefore it's still valid? Where did it come from? It's intriguing. And I wonder how many of the star scheme of stuff comes out of tools like Power bi, like you said, chicken and egg. I think it's just important

Johnny: to question it and have that curiosity and go with it from there.

What was it? What was the other thing somebody was talking to me about the other day? I had a really good deep data conversation with a former colleague and we had to write, good matter bang, the world to write. So his problem is that he sees people. Apply patterns with no context. And because they've seen a Pattern applied before, they assume that's the right Pattern every single time.

And the nuance isn't absolutely understanding what patterns are available and which ones to use when and sometimes when to break the rules. Sometimes when to knowingly implement an anti-pattern almost on purpose. 'cause actually it serves a particular edge case and it makes sense. In the right context and you can justify it.

Shane: Yeah. Which makes it a Pattern then, which is really weird because actually, yeah, a solution to a problem, and that's not commonly applied, but with the context actually works. All right. And on that one, just to close it out, if people wanna get hold of you, what's the best way for them to find you? Read what you're doing, listen to what you do.

Johnny: The worst thing you can possibly do is Google me. 'cause if you Google me, the first hit's gonna be an albino blues guitarist who's now dead and once performed at Woodstock. So Googling Johnny Winter doesn't work. I have can't believe I've gotten this far in the podcast and not mentioned it, that I've got this kind of data persona called Gray School Analytics.

But yeah, it's a Heman reference Castle Gray School. It came out of the idea of Heman slogan was I have the Power and I had the Power bi. So that's where Gray School Analytics came from. And then I've turned it into sort of a massive skeletal reference as well. So yeah, if people look me up on LinkedIn, you'll see quite a few skeletal themed memes being shared there with various sort of data contacts in them as well.

Grayscale analytics.com is my website. The version that you can currently see is the original version, 'cause I've reverted it back, which was just really a blog. I've got a YouTube channel as well, so you can get me on YouTube. That gray school analytics.com websites in the process of being revamped.

'cause I'm looking to launch my business in September. But yeah, LinkedIn or Gray school analytics.com are the best place. Gray school with an E as opposed to an A. 'cause I, I anglicized it partly 'cause I'm English and partly 'cause I didn't want to get sued by Mattel.

Shane: Maybe bring the grayscale analytics story right in the beginning.

Next time it's how I know of you and I've gotta do shout out for probably what I reckon must be one of the best LLM engineering prompts in the world that you can constantly generate grayscale images that actually look like you've got a gray gold character sitting in your office somewhere and you're just moving them around.

The quality of those generations are pretty damn awesome.

Johnny: It does. Well, you're not the first person to point it out either. I tried to do one skeletal playing cricket the other day and it couldn't get its head around that. But yeah, most of the time it manages to make a different decent fist of it.

Shane: Alright, it's been great. Thank you for going through all those dimensional patterns. I've ticked off another set of the modeling patterns for the podcast and can't believe it's taken me so long to get round to this one. It probably should have been the first one. But anyway, thank you for that and I hope everybody has a simply magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Data Engine Thinking Patterns with Roelant Vos

Shagility — Thu, 19 Jun 2025 03:37:50 GMT

Join Shane Gibson as he chats with Roelant Vos about a number of Data Engine Thinking patterns and his new book Data Engine Thinking.

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/data-engine-thinking-patterns-with-roelant-vos-episode-66/

Listen to AgileData Podcast Episode

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

Buy the Data Engine Thinking Book

https://dataenginethinking.com/en/

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Detailed Briefing Document: "Data Engine Thinking: Automating the Data Solution"

Source: Excerpts from "Data Engine Thinking: Automating the Data Solution" – A podcast interview with Roelant Vos (co-author) by Shane Gibson.

Date: June 19th, 2025

Key Speakers:

Roelant Vos: Co-author of "Data Engine Thinking," experienced data professional (25+ years), with a strong focus on automation and technical data management.
Shane Gibson: Host of the Agile Data Podcast.

1. Core Concept: "Data Engine Thinking" and the "Information Factory"

The central theme of the book, "Data Engine Thinking," and the podcast discussion, is an end-to-end approach to building data solutions that are inherently "designed for change." This is contrasted with traditional, often manual, data practices.

Designed for Change: The fundamental goal is to create data solutions that can easily adapt to evolving business needs, data models, and technological landscapes. This is achieved primarily through automation and pattern-based design.
Information Factory vs. Information Value Stream: Shane Gibson introduces two concepts:
Information Value Stream: Focuses on the "product thinking" side – identifying problems, ideating solutions, prioritising work, and delivering value to stakeholders.
Information Factory: Focuses on the "platforms that support that work, and the way we move data through it all the way from collection through to consumption." Roelant confirms that "Data Engine Thinking" is "definitely more in the information factory area." This highlights the book's focus on the underlying architecture and automation capabilities rather than business process design.

2. The Imperative of Automation and Lowering the Cost of Change

A recurring and foundational idea is that automation is crucial for agility and innovation in data.

Historical Context: Roelant's journey over 25 years has consistently focused on automation, starting as early as 2000 with tools like Oracle Warehouse Builder and Tickle. He notes, "Absolutely. Yep. So I started that at 2000 actually... Working with Oracle Warehouse Builder made rest in peace. We used this stickle language to try to automate things."
Enabling Experimentation and Risk-Taking: Shane eloquently summarises the core benefit: "if the cost of change is lower, then we are more willing to change... We can manage more, change more often. We can iterate more often. We can take more risks. We can make earlier guesses because we know the cost of refining that guess in the future is lower than if everything was manual."
Addressing Flawed Models: Data models are "always going to be flawed" because initial interpretations of reality are incomplete. As understanding grows, models need refinement. Automation makes this refinement cost-effective: "The more you learn, the more you want to refine that model. And then you have to go back every time to update your code base. And that's why automation is such a critical thing."
Overcoming Complexity: Data solutions are described as "not complicated, but they're complex," due to "a lot of tiny moving bits, pieces." Automation is the key to managing this inherent complexity.

3. Pattern-Based Design: Design Patterns vs. Solution Patterns

A core distinction made in the book is between "design patterns" and "solution patterns," which provides a structured approach to building robust data systems.

Design Patterns (The "What"): These are conceptual, holistic, and technology-agnostic. They define "what do we need to do, how should work, what are sort of conceptual boxes that we tick." Examples include:
Historizing Data: The need to capture and store every historical view or instantiation of data over time. This is considered mandatory: "you always want to bring in a historicization pattern into your platform on day one... because you are gonna need it."
By-temporality: The complex concept of managing both "assertion time" (technical timestamp) and "state timeline" (business validity). Roelant states, "you have to have these two timelines in place at all times and solve the problems associated with it, like the bytemporal."
Reference Data Management: The ability to capture and iterate lookup data not originating from source systems (e.g., Excel, Google Sheets).
File Drop Capabilities: The ability to ingest ad-hoc files (CSV, JSON, XML, Excel) into the platform.
Solution Patterns (The "How"): These are specific implementations of design patterns on a given technology stack. They involve the concrete choices of how a design pattern will be realised.
Examples: For historizing data, solution patterns could be SCD Type 2 dimensions, Data Vault satellites, or specific methods in DBT, Oracle Warehouse Builder, SQL, Python, etc.
Technology Agnostic Design: The goal is that the core design pattern (e.g., historization) remains constant, but the "physical modeling can change depending on the technology, the user, the tools, all those constraints." The "technology becomes less of a... Yes, you need to optimize it, but it's also not really where the IP resides."
Mandatory Design Patterns: Roelant argues that certain design patterns are "mandatory" because "to truly work with data in your organization, there's no avoiding them. Sooner or later, you will run into these problems, so you might as well tackle it upfront."
Bulletproofing and Reusability: The aim is to create "bulletproof" and "reusable" code for common solution patterns, similar to how Data Vault hub/link/satellite loading code can be hardened. This reduces manual effort and increases reliability.

4. The "Engine" Concept and Future State

The book envisions a highly automated "Data Engine" that can intelligently manage and optimise data solutions.

Optimizer Component: A key component of the "engine" is an "optimizer." This allows the system to "calculate based on use patterns, what the best combination of physical, virtual objects are to deliver that" based on directives like "cost storage, IO latency." This means dynamically choosing whether to virtualize or physicalize data based on specific performance or cost requirements.
Config-Driven Platforms: The ideal future state is a fully configuration-driven, end-to-end platform where users can "pull up the design patterns, you tick some boxes for them, pull up the constraints around which technologies it's allowed to use, and it actually writes the solution patterns and the code and deploys itself."
Virtualization as a Test: Roelant suggests that virtualization, even if not physically implemented, serves as a test of the robustness of patterns: "If you can do it [virtualize], you can also physicalize... but if you can't virtualize it. Then something's wrong with patterns."

5. Narrative Approach and Practical Application

The authors employ a specific narrative strategy to make the complex concepts more accessible and actionable.

Fictitious Company ("FCOM"): The book uses a fictitious company to embed theory within a story. This "takes that emotional aspect and put it into the book storyline so we can have that conversation about pros and cons and what it means." This approach echoes books like "The Goal" and "The Phoenix Project," where patterns are revealed through a narrative.
Practical Implementation: The book is designed to be highly practical, allowing readers to "implement a fully automated solution, uh, themselves."
GitHub Repository: Complementing the book, a GitHub repository will be launched with "all these code examples and patterns and things you can run yourself, and templates and everything." This aims to foster collaboration and provide concrete implementation examples.

6. Challenges and Learnings in Writing the Book

The podcast touches upon the significant effort involved in synthesising 25+ years of experience into a coherent framework.

Seven Years in the Making: The book has taken "almost seven years" to write, highlighting the complexity of the undertaking.
Collaborative and Harmonious Process: Despite co-authorship, Roelant notes remarkable agreement between himself and Dirk, leading to a "very harmonious" writing process. Disagreements were rare; instead, there were moments of "didn't know something yet" leading to further exploration and coding.
Valuable Role of Training: Creating course material and training people before writing the book proved invaluable. Roelant states, "at some point I started recording the trainings for that purpose. And then after the training sessions, I was updating the notes in the slide decks to find the right words. And that all made it into the book." This iterative process of teaching, getting feedback, and refining content directly informed the book's clarity and structure.

7. Antipatterns and Dogmatism

The discussion subtly highlights issues in current data practices that "Data Engine Thinking" aims to address.

Handcrafting vs. Automation: A major antipattern identified is the tendency for data professionals to "like to handcraft things oddly." This is contrasted with the software domain's adoption of templated, automated deployments (e.g., Terraform).
Reinventing the Wheel: When changing technology stacks (e.g., SQL Server to Snowflake to Databricks), data teams often "reinvent all the patterns again... That's just crazy." The book seeks to provide a shared, reusable library to combat this waste.
Avoidance of Complexity: Data professionals can be "a bit shy of exposing too much complexity if they can avoid it," opting for "simpler patterns" that may cause issues later. The book argues for upfront embrace of necessary complexity, managed through automation.
Dogmatic Approaches: The shift away from rigid adherence to single methodologies (e.g., Inmon vs. Kimball, Data Vault vs. Dimensional) is acknowledged, promoting flexibility based on context.

Conclusion:

"Data Engine Thinking" proposes a paradigm shift in data solution development, moving away from manual, ad-hoc, and project-specific builds towards automated, pattern-based, and inherently adaptable systems. By clearly defining "design patterns" (the what) and "solution patterns" (the how), and advocating for their codification and reusability, the book aims to lower the cost of change, increase agility, and ultimately build more trustworthy and future-proof data platforms. The authors' extensive experience and practical, code-backed approach suggest a significant contribution to standardising and industrialising data engineering practices.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Roelant: Hey, I’m Roelant Vos. Thanks for having me.

Shane: Hey, Roelant. Thank you for coming on the show Today. We are going to talk about the patterns that are in your book, data Engine Thinking, but before we do that, why don’t you give the audience a bit of background about yourself.

Roelant: Yeah, sure. Thanks Shane. I’m fu, I’m a Dutch national, but I’ve been living in Australia for almost 17 years now, so I’ve got this weird blend of accents and I’ll try to not bother too much with it, but I’ve been working in the data space for more than 25 years, so pretty technical guy. I’ve started with computer science and then got my first job as a Conos consultant, so building reports and cubes and everything, and quickly realized that.

Building these reports is that’s good fun and it’s important, and obviously it’s what people often consume their data with, but the issues that you find in the data, you can’t easily or should easily fix them in the reporting sites. Slowly but surely descending into the depths of data management and ETL data integration all the way down to database administration, more the real hardcore technical low level stuff.

And then slowly my way back into light, took a management position at Alliance for a long time. And now I’m back to more the technical programming area of things. ’cause there is this pendulum always I find, right? So as a data professional, you see these problems in organizations and you wanna fix it, make it better, make the data more easy to consume and to an extent you can with your tools and skills and all this stuff.

But problems are always. Often, partly organizational as well. So then you find yourself going into more management and trying to change your organization to work better with data, but you lose the technical aspect of it a little bit. So it is always this back and forth between I wanna do something to improve the organization and also wanna keep up to date with the technical skills.

So that’s been my story all along,

Shane: and it’d be fair to say that. Over those 25 years, you’re focused a lot on automation. Before the term DataOps came out, I always remember a lot of your writing and thinking was around how to automate some of the tasks so that you don’t have to do them manually.

Roelant: Absolutely. Yep. So I started that at 2000 actually with Tickle, which was part of Oracle. Oracle Coase at the moment. So. Working with Oracle Warehouse Builder made rest in peace. We used this stickle language to try to automate things, and we came up with ways to change the physical modeling to align better with automation, because refactoring takes a speed.

That’s also when we run into more hybrid modeling techniques such as data vaults and things like that, because that was already more mature than we were doing, but it still aligns with the same mindset of modular patterns that are easier to automate, quote unquote. So that whole approach towards making life easier with automation, that was, it started out as we can do this and it’s also saves a bit of time, but it evolved into this mindset of working with data is based on a series of assumptions.

Which are translated into a model and translated into a, an interpretation of the world, right? Literally what the model is. And those interpretations are always going to be flawed because you start with a certain background that you may not have the full awareness of understanding of everything is in place and the history of why things are now where they are.

That idea that your interpretation of reality model is always concept. Exchange your interpretations. The more you understand the data, the organization, the history, the knowledge that they in people’s minds to show a memory. And the more you learn, the more you want to refine that model. And then you have to go back every time to update your code base.

And that’s why automation is such a critical thing. So automation and model driven engineering, patent based design, those kinds of terms have always been the red thread through through the work. Yeah.

Shane: And for me, it’s this idea that if the cost of change is lower, then we are more willing to change hundred percent.

We can manage more, change more often. We can iterate more often. We can take more risks. We can make earlier guesses because we know the cost of refining that guess in the future is lower than if everything was manual. Everything’s hard coded. Everything involves human effort that’s large. We just won’t take the risk because we want to do a lot of design up front.

Because we don’t want to invoke that high cost of change. I’m with you. That the ability to automate it gives us the ability to experiment.

Roelant: Absolutely. And then, then that links into things like dataware automation and the tooling that exists and the patterns that exist for it. So then you think about what kind of components do we need?

How can we really specifically define them so they are reusable and they tick all the right boxes, and I’ll get to what those boxes are in a bit. But my experience with the tooling back in. Right. Always felt there.

To be looked too easy because certain things were missing, and at the same time, it shouldn’t be that hard because once you figure it out and you embed it once, then you can use it as many times as you want, right? So there’s this feeling that always lingering that the patterns that we were using weren’t as correct as they should be.

That pattern and that, I guess that implementation approach. Generic, truly tick all the boxes, then you know, we can’t park it and leave it behind, because once we figure it once and wrong, then we’re done. But all these different tools and different frameworks that many people use, there was always some issue with it that sort of made it incompatible.

And the way I look at it, these systems, these data solutions, as I call them, they’re not complicated, but they’re complex, right? There’s a lot of tiny moving bits, pieces.

Once we talk through end to end, what it means to set up a, an architecture that is designed for change and what these components are that need to work together, what kind of frameworks you need to think about the message. Is that correct? Viable pathways to implement things. But you need to understand if you change one of the components on the left hand side, then it has downstream impacts on the components on the right hand side.

So all these components they need to be aligned, need to be in sync with each other, so. The correct function results, and that’s really what it is. So there’s a couple of options, but you need to understand what the consequences are of those options and make a decision that you then capture in your design and then you move on.

Working through that allows us to then build co-generation templates and automate everything from there.

Shane: Cool. So what is the book about?

Roelant: So data engine thinking is an end-to-end overview of what it means to create a data solution that is designed for change. So what frameworks do we need? What kind of steps do we need to think about?

What kind of decisions do we need to make along the way? And it’s been into two general sort of storylines. So one is that the theory read say, how do we with key distribution by temporal data, late driving changes, how do we manage time? How do we manage information models versus physical models? All that stuff, right?

So that bulk. You’ve a fictitious company code

all. There’s a couple of reasons for that. One is that we feel like some of the implementation choices that you have are really opinionated, and that’s in some cases, rightfully. In some cases it’s, we try to make the point, it doesn’t really matter, but we try to. Take that emotional aspect and put it into the book story line so we can have that conversation about pros and cons and what it means and how you get to decision outside of the standards, if you will.

So it’s not like you have to do this, but these are your options. This is what we would recommend. And then we keep the emotional aspect of it out of it, and at the same time, it follows typical. Discussions and dynamics that you have in a given company, right? So how do you get to an implementation of a new data warehouse or data solution or anything?

So it’s that first thing. Let’s define the vision. What are the current problems? How do we have the conversation about let’s buy a tool versus we need to. Spend time on information modeling. How do we translate from the information model to a physical implementation? What kind of technology shell or methodology should we select?

And then all the way to how do we introduce automation? Why does it matter? How do we implement testing? How do we roll it out into DevOps all the way to the end result?

Shane: Okay, so did I hear that you pick up. The pattern for the book out of the Goal and the Phoenix Project where you tell a story to reinforce some of the messaging that you’ve also, rather than having somebody have to guess what the patterns are.

’cause if you go for the Phoenix project and you read it after a while you go, oh holy shit, this is about demo ops. I can start seeing it. But it comes as a bit of a shock, right? If you haven’t read it before.

Roelant: Yeah. Yeah.

Shane: So I think what you’re saying is you’re kind of. Calling out the patterns and then using a quasi real life example story to then give a reinforcement about how you decide between the options.

’cause they’re all good options, but given the situation, certain patterns are more valuable given the context and others become antipas. So combining both describing the pattern and then telling a story to give us a reinforcing model of how you choose and how you use. Is that right? Yep,

Roelant: that’s right.

Yeah. And that goes hand in hand all the way through the book. Yeah, we hope that people can use this to implement a fully automated solution, uh, themselves should be really practical.

Shane: Okay. And then the other thing is I tend to talk about information value stream and information factory, and I’ll give you the difference.

So information value stream is we in a team work out how they’re gonna identify a problem in the organization, how they’re gonna ideate. The possible choices they have of solving that problem, how they’re gonna discover which option looks like it’s the most viable, feasible, valuable, how they’re then gonna get the work prioritized.

It can go into their delivery queue. So it’s more product thinking. On the left hand side, I’m using my hands here. And then on the right hand side, once it’s been prioritized, they pick it up and they do some light design. They do some building, some deployment. They release that value back to the stakeholders they may maintain or decommission over time, and then they go back and it’s like a continuum.

They again, aiming for speed through that so they can find a problem, deliver back quickly. Yep. An information factory is more about thinking about the way we build the platforms that support that work, and the way we move data through it all the way from collection through to consumption, and where we can look at automation, where we can use platforms and tools to automate.

Work that has value when it’s automated and try and turn that work into a as much of a factory as possible. Still realizing that it’s half art, half science, ’cause no data ever survives our first engagement. So using those two models, I get the feeling that data engine thinking’s about. Information factory about how you make decisions around the vision for the platform, the components you need to build for the platform, the way you make decisions on the trade offs that you have on every component, how you automate it, how you deploy it, all that engineering, craft, and practice.

Is that right?

Roelant: That’s spot on, right? Definitely more in the information factory area. So as part of the modeling and the, I guess the information gathering and the interactions with the quote unquote business side of the company, the interface with that is the information model. So we use a couple of examples, but stick to FCOM mostly where we say you need to have a understandable way of communicate with everybody.

In a way that is ambiguous. We feel like ization and concepts in general are really suited for that, but we don’t talk about how you workshop that. But then you feed in the results of that into your engine and then everything will refactor automatically and look the same way or however you want it. Not the same way, but yeah.

The result of that decision.

Shane: Okay. And so when you talk about information model and the, the patent that you’re adopting, are you talking about a business model or conceptual model or the data for the organization? Yep, a hundred percent.

Roelant: Yep. So we make the case that the, that is true ip. And you want to use things that are available in the market at the moment for that, but definitely at physical pretty much automatically.

So that’s also a pattern that you can decide on and codify and then run. So definitely the, the co conception model is the input and then everything downstream will be generated.

Shane: And then often I talk about hydration. So I talk about if you get your conceptual model right. And then you understand the attributes, which is a form of logical modeling, then you should be able to hydrate your physical model using any modeling pattern you want.

You should be able to deploy it as anchor or data vault or dimensional or relational, because actually it’s like you said, it’s that conceptual understanding of the data that’s more important. And then the physical modeling can change depending on the technology, the user, the tools, all those constraints that you have that make one physical modeling pattern seem better for you in that situation versus another.

Versus the old days where we were dogmatic and we were data vault for everything, or dimensional, or not quite Inman versus Kimball, but definitely data Vault versus Dimensional was a key scrap a few years ago from memory.

Roelant: Hundred percent. In that context, we define. Term pattern into a design pattern and a solution pattern.

So the design is more the what do we need to do, how should work, what are sort of conceptual boxes that we tick is the specific implementation on a given technology. So the, the design pattern itself is. Holistic, generic technology agnostic, all that stuff, and that, that includes translating the logical model to the physical model, for example.

So you have a specific idea on how this should work in technology X. Then you’ve got switched over into something else when you need to, so the technology becomes less of a. Yes, you need to optimize it, but it’s also not really where the IP resides.

Shane: Okay. And so if I play that back, because I was think about the complexity of the path.

So I can go and say there’s a design pattern of historize data. So where data changes in the source system, we want to be able to see every historical view instantiation of that data over time. So that’s like a big pattern in my head, was that what you call a design pattern? And then I can go, if I deploy SCD type two dimension, that gives me a way of storing historicized data.

Or I could do a satellite of a data vault hub that gives me the ability to store ization. I can reconci stand change records in a single table. There’s a whole lot of technical patterns for the physical modeling. Then I could do that using DVT or I could use that doing Oracle warehouse builder. If I still had it, I could do it in SQL code, I could do it in Python heaven, for forbid.

I could do it in Java or Scala if I’m back on the Hadoop stage. And then actually I could physically store it in some form of Databricks, some form of fabric, some form of Snowflake ES history. And they just all or both? Yeah, they’re, yeah. Or as we always do, and all of them. So they’re just solution patterns.

They’re just different ways, but. People often struggle with the complexity. ’cause we’re almost in an ola cube now, which is, Hey, I’ve got this one simple design pattern, historicized data. I know, sorry, I didn’t bring in my data layers so I could be persistent staging EDW presentation, or I could be bronze, gold, silver.

I have a whole of different language just for the layers. So do you cover the way you should think about it when you have all these complexities to take it from a design pattern to a solution pattern?

Roelant: Yeah, and it’s actually one of the more, more difficult things we found to explain and to write, and we ended up moving text and content around quite a bit to try to explain that the best way we can.

So first of all, the solution architecture can partly as the collection of all design patterns. So you’ve got. Specific patterns apply to specific architectures is another way of looking at it. So we start saying, these are the high level architectures you can think about, right? The high level building blocks.

PSA was mentioned, precision was mentioned, and we say that at that level, you need to make a call and include one of these design patterns in your. Solution design because that will have a lot of downstream impact. And we try to explain there, if you don’t do this, then you need to think about this. And if you do this, then you don’t do this.

But if you do that, then, so it’s almost like a decision tree of options that you need to think about. And that’s a critical first part because once that is in place. It changes the way how the next layer or downstream processes will work. For example, if you don’t have BSA, you are refactoring, you’re reloading of historical data in a different mold will look different, right?

You can still do it. It’ll just be different. So the complexity is shifting that way a little bit. If you use an insert only pattern, is that, or also allow updates in the data. It shifts the way how you interpret the timelines when you deliver data for information. So we need to take a couple of high level decisions and, and understand what the implications are.

And

Shane: if I think about it. The way we’ve dealt with this in the past is we have blueprints for reference architecture. So if you look at any experienced data architect that’s both a data architect and a data platform architect, they effectively have a small number of blueprints in their head that have been crafted for certain styles of organization, large versus small, distributed versus not complex data, fast moving data.

Big data where we talk about terabytes and then a whole lot of other patents, and they then bucket them into blueprints that are templates that they bring first, and then they iterate it for the things that don’t fit for that context of that organization. I think what you’re telling me is you’ve taken that.

Upper level to say, actually, here’s the building blocks of those blueprints and reference architectures. Here’s the choices that you need to make. That will tell you which block to adopt. And then the consequence later or earlier in your information factory, that decision will make.

Roelant: Yeah, exactly. And so that, that’s a good way to, to put it at that level.

And what we also feel is that once we define that reference architecture, and we do select one, right? So we’ve got 12 different blueprints, if you will. And then one that we say, this is our reference, we’ll just continue with that, but keep those in mind and we’ll refer to the other ones, uh, throughout the book.

But we also feel that this complexity is something that can be understood and managed and enc coded. Codified one. To the point that we then don’t have to worry about it as much. And that’s sometimes a controversial topic, right? So basically what we’re saying is it doesn’t matter who you are, if you use that approach, it does everything you need.

And, and just making sure that one solution will always work. And again, not talking about the technical level, the solution pattern, because the infrastructure and technology will be slightly different from case, but the architect, the design of the platform itself should be the same for everyone.

Shane: So let me ask a couple of questions to clarify what you’ve just said so I can interpret that and what I call an anti-patent, which is whenever you’re modeling party role is the only table you need, because it covers every use case for every organization, and I’m not personally a great fan of that pattern, or I can infer it as you saying his data.

By default because you’re gonna need it. And therefore, if you don’t do it, you’re gonna have to do it at another stage. It always has to be done. Just decide which patent you’re gonna use to do it, and here’s your choices. So design block, we do need to historicize data. Now we just need to figure out which solution design we’re gonna use to create the ization.

Roelant: Yeah, pretty much. Yeah, it’s definitely second. So the part thing that’s more the modeling side and there’s actually a on that, how played it, but I point I’m trying to make. A given architecture to, to make it do what we wanted to do, which is being endlessly refactor and all that stuff, right? So you can morph your model and move with organization, all that good stuff.

You have to have a couple of things. You have to have digitalization point, you have to specify. What we call a assertion timeline. Estate timelines, like a technical one, a business equivalent, right? Because you have to be able to handle later life in transaction, future dates, data stuff. So one of these blocks is to have these two timelines in place at all times and solve the problems associated with it, like the byt, temporal, like, so these are complex things.

We, we argue you have to have all of them ization by temporality. A way of delivering information. You can make some decisions here and there, but there’s a couple of these core things to, to make it work. Business one, defining. Level. If you don’t have that, then it’ll not work. But we also argue to truly work with data in your organization, there’s no avoiding them.

Right? Sooner or later, you will run into these problems, so you might as well tackle it upfront. Embrace that complexity, make sure it’s. And codified once and once aro, and you don’t have to worry about it as much because at that point it becomes pattern based, right? Because you know that you need two timelines.

You know that the way to fix timeline gaps or issues in date met all and stuff is the same stuff over and over again. The pattern for delivering data is gonna be the same no matter what you do. So these are complex things by themselves.

Shane: I’m gonna draw down that in a second. I just wanna go back and get a couple more examples.

’cause by temporality is quite deep thing to understand as a data person. So if I go back to a couple of examples that I think are a little bit more simpler to understand that I always see in a data platform. So one is reference data, the ability to have a bunch of data with some lookups where it’s not created.

Source system. It’s in Excel, Google Sheets, or in a Word document. And we need to be able to capture that data and iterate it in the data platform. So I see that one a lot. The ability to file drop, the ability to grab a CSV file, a J file an X ml file of heaven forbid an Excel file, and manually drop it into the platform for an ad hoc task is one I see all the time.

Okay. So those are design patent blocks. They’re things that. You don’t have to build it on day one, but you’re gonna build it. If you’re teaming your data platform survive for a couple years, you will create those features. ’cause everybody wants them.

Roelant: Do that. Then make sure that these boxes are ticked because it needs to adhere to the fundamental principles.

Everything else has to.

Shane: Okay. And then if we look at an example of SED two and hub sets and links. So the value of those patterns is you can write SCD two code and. It’s the same code. It is a known in solution pattern. You write that code, you can harden that code, and if you want to create an SED two type dimension, that code becomes bulletproof.

It’s the same one of the reasons I like Vault. Yep. Because you can create the code that says create hub, create link, create set load, hub load, link load set. You do all the exceptions ones, of course, but those six pieces of code become bulletproof. They become reusable. They are used so often that actually every bug and every use case gets solved.

That you can just trust that code. And I think what you’re saying is we take that and apply it to the design patterns and say, okay, let’s come up with solution patterns that get used time and time again. Then we’re effectively iterating those solution patterns. Giving it more test rounds, more coverage, that they become bulletproof and they become easier to implement and therefore, why wouldn’t we do ization if we could just drop a solution pattern in and know it’s gonna work.

We don’t have to build it ourselves from scratch. Is that what you’re saying?

Roelant: Absolutely. And, and extend that to more complicated things like the by temporality you mentioned.

Shane: Okay. And so when you get these complex use cases, then what you’re looking for is a proven solution patent. So you don’t have to invent it yourself and you’re effectively dropping that Lego block and the problem’s taken away to a degree.

And that’s controversial, right? Yeah. It’s confide, but I dunno why. So let’s discuss that because in the DevOps world, you create Terraform templates to deploy your servers as cattle. You protect the Terraform code like nothing else. You deploy your servers, your server away, you redeploy it. It’s a proven pattern in data.

We seem to like to handcraft things

Roelant: oddly.

Set of patterns that has been sufficiently enough described to tick all the boxes that actually work in, in all scenarios. So what works for one methodology doesn’t work for some other one, or there is a couple of gaps that a couple of shortcuts are taken to avoid some of the the nasty things that may happen.

And then you run into them and then you have to make changes. So it’s not matured enough yet. It’s almost there, but not quite yet. So that’s also a reason why a lot of handcrafting is also. Data professionals seem to be a bit shy of exposing too much complexity if they can avoid it. So they would opt for simpler patterns if that tick the box for a little while longer, and then don’t worry about the things that will go wrong in the next six months or so.

So, ah, this’ll make it too complicated because what will people think? And, and I sense that quite a bit and I’m like, yeah, look, satellites is a great example. The data, false satellites, right? So yes, ization, that’s easy, but. We argue you need to set up your satellites for by temporality always, basically, because if you don’t, you’ll run into the problem and then you have to go back and invent the wheel and change your patterns and, and there’s no proper tool that fully supports that.

And you go into hand coding and it gets all kinds of weird. So it’s that combination of not everything has been fully thought through yet, in my opinion. Combined with, let’s be a bit careful with how complex we wanna make things because what will people think necessary?

Shane: So yeah, we’ve talked about this off and off over the years and I struggled to find a template or a pattern to describe patterns.

So it’s one of those things that I can write a name, I can write a description, I can provide some context about where it has value or not. I can provide in one example, but then it just starts. Getting hard and for somebody to be able to pick up a pattern and read it and actually understand how to apply it is something that I’ve always struggled with.

With your design patterns, how far down that path did you get with the book?

Roelant: We have defined two templates, one for that we follow throughout the book. By the way, the book will come when, when we launch it. We’ll also launch a GitHub with which has all these code examples and patterns and things you can run yourself, and templates and everything.

So everything we’ve used in the past, obviously in our projects. And then added to the book will be publicly available also with a view on collaborating on it further. So these templates will be there and it has a couple of things, but having a template with a couple of headers is never gonna fully cut it because you need more of a dynamic, interactive sort of thing.

So the way I look at it is that the sections that we’ve defined in the book will go through all the options and considerations as part of that pattern. To use the word again, air exception handling reference data. So we say, this is what you need to think about. And then at these points we say in your template, include this and explain to therapy why you are doing this.

Why am I excluding or including this? Why am I making this decision? So we basically provided this checklist of stuff that you then need to include in the template, and that will be part of your solution design. Acknowledging that it’s almost like a graph model, right? That had these things together. So also these patent definitions, but we think that’s a good start because it, it, it elevates it to a higher level concept and then breaks it up into smaller pieces that you can include or exclude.

Shane: Okay. So it becomes a patent library effectively, and then some golden paths of guided paths of the way you would typically put the patents together. Because there, there are, like you said, there are a common set that we all do and then there are the exceptions that we end up doing because the organization context for some reason is so different to everybody else that we go, ah, actually we have to deal with that one differently here.

I. For this reason. But we only deal with that as an exception because, I dunno about you, but the number of times I’ve worked with an organization that said they’re special and unique and I’m sitting there going, uh, I’ve seen this before a few times, I’m not quite sure you are special, unique. Still special on those unique, right?

Yeah, yeah. True. And then we end up spending six to 12 months building something they never use because they said they wanted it and they never actually use it.

Roelant: Yeah. And the second part is always an interesting thing.

Time, the more maintenance, all that stuff. So again, automation is key to take that concern away, right? ’cause we feel you need that complexity. But we also understand you don’t wanna build like the jaguar for when you also only need a master two or whatever to, uh, to get where you need to be initially, but then at least take a couple of options to prepare you for the future.

That automation is essential there. So once you have that foundation in order, then.

And also the notion of, I guess, the complexity that we’ve attributed. There’s this ideal that I have that I’ve been talking about for years and years and years, and it’s also one of the red thread in the book, even though I don’t really make it as explicit. Often it’s the virtualization id, right? So my dream is still to have this virtual representation of data that doesn’t require persisting of data through all these layers and areas, all the complexity, you just have the data you need to historize it in this most fundamental way.

But after that, all the things that you put on top of that are derived the same way from the information model as you would traditionally do by creating physical model, creating sort of data logistics like quote ETL processes and all that data movement, right? So all that stuff can be replaced with the virgin transportation.

Funny thing is it still needs to tick the same boxes you still need by temporality, ization, reference data, blah, blah, all the things we’ve talked about. But it’s such a powerful idea and I think in the book we try to always come back to and say. If you do this, think about how it would behave in this virtual world.

Can we still do this? Can I still go to one version of the data and then to another version of the data, back to the original version of the data and make it look the exact same way, right? Deterministic processing. If the answer is yes, then you know your solution works. If the answer is no, you’re gonna get in trouble.

Similarly, for things like do I. Select an insert only approach, or do I support updates in my data? Right? So if you do an SD two as as your example earlier, do I update the end dates? Can I derive that? Why is it necessary? Does it matter? How do we do things like driving key link satellites in an insert only approach?

For example, how do we do byt temporality in the insert only approach? The funny thing is you can do all that. There’s no limitation on doing that, but it’ll have some impacts. Your consumption of data, it’s gonna work. So that kind of ideas is something that continues to fascinat.

Shane: I’m the same. We’ve experimented with fully virtualized in our platform as a core concept.

And yeah, we always end up. Regressing back to physicalization of data at some points, because we had a constraint either in cost or performance, but more often we had a constraint in Physicalization in our thinking. When we bring in the next. Pattern and it’s all virtualized. We lose the ability to trace the physical data flow to understand whether we’ve got their patent applied properly or not.

And I’ll give you an example. It’s that graph problem. There’s just too many things going on for our little brains to little brains things. So for example, we run a form of. Data vault modeling for our physical structure, but not for our conceptual structures. And we treat link tables as event tables effectively.

So we can say a customer ordered product from this store and this employee as an event. And we hydrate our consume layer so you can see all the data. De-normalized into one big table. So for that event, you can see the customer and all their details, the order and all their details, the product and all their details, the store and all its details, and the employee right in one table.

So that’s, that was first cut easy. And then you go, okay, now we wanna make sure that when we hydrate that we are picking up the right version of the data. That existed at the point in time that event happened. Okay, now we’ve gotta go and take our hubs and sets or our dims, and we’ve gotta find the right point in time to hydrate it so that it was looking like the time that event happened.

And again, we can physicalize that and we can then virtualize it and then we go without ETL. Our transformation code we call rules, the code or the logic that’s applying some change to that data. Again, we can virtualize it. But now if we wanted to say actually. We’re gonna virtualize it. We’ve gotta say, this event happened at this point in time.

This was the raw data for each of those core concepts at that point in time, and this was the business brawl that was being applied at exactly that point in time. And then switch it out. While the reasonable volumes of data, the platform could probably handle it. We can’t see that graph. It becomes so complex that.

To test every case and make sure the patent’s hardened, becomes so intensive that we go, is it worth it right now? And what you are saying is, if that patent was available to us and we could just apply it, then hell yeah, we would. Because we know it’s a proven patent, we can trust it’s already been tested.

We don’t have to craft it ourselves and then we can trust it to be virtualized. So I think that’s part of it, is when you have to get that graph in your head and every piece of complexity you add, ’cause it does become complex, makes it harder to prove that you are. Code is doing what it should for every use case.

That’s the problem that we need to,

Roelant: yeah, absolutely. And there, there’s so many to unpack there. Right? To start, I always say we’re in the business trust. Are delivering the data, so we have to prove it’s correct. And by correct, I don’t mean according to explanation for the business, but we need to be able to audit to the original data at any point in time because if we don’t, we don’t trust that easy get, we lose all of that.

So having that and that, that lineage is essential and it ties into versioning that we have to version control the metadata and the code generation templates hand in hand with data, which is built in, but those.

There’s too many practical barriers at the moment, still to make that realistic, but I see it as a business test to make sure that the patterns that you do have are working as intended. So if you can do it, you can also physicalize and or materialize and do all these other things, but if you can’t virtualize it.

Then something’s wrong with patterns. So it’s more like a test to make sure everything is correct. And then when performance and these, those bottlenecks are resolved potentially, or the data sets are manageable enough, the ideal that I have to mitigate those kinds of concerns that you mentioned, is that almost the same as how API versions work in my view?

So you have this data solution that acts as a surface for data consumers to extract our information, and they use that. Data as per a specific version of availability, like a release, right? You’ve got the model like this data, follow these rules, et cetera. So you get the data and then if you change something, anything in your design, you could theoretically, again.

Spin up a second version that is version 1.1, next original version. Give people some time to cut over and to understand what’s happening. And all those lineages and everything will still be captured in, in your version control, history and visible. So that’s how I look at it. And again, this is ideal world, but I do think thinking about your solution that terms helps to make it as robust as it can be.

Shane: And I think technology’s moved on. Yeah. ’cause we use Google BigQuery outta the covers and we can virtualize all the way through.

Roelant: It’s always that constraints that you have to work with.

Shane: Yeah. I remember the old days of Oracle servers and we used to have to get them ordered in Singapore and shipped over and more memory put in.

And then we got really expensive Teradata boxes that could handle the volume, but cost of fortune guys from other duck talk about, it’s not a big data problem for many people. And so I think that’s true. So I think we’re a lot closer of the technology being able to support virtualization. I just don’t think as data professionals we are, and this idea of having reasonable patterns that then have solution model patterns to support it, that then have proven code gives us one step towards being safer.

Being able to virtualize all the way through, and especially with things like zero copy cloning, where we can actually don’t have to take a full copy of the data back to test things out. So maybe the first step is whenever you’re doing any development, do it in a end-to-end virtualized. Pattern to test that it works.

And then physicalize when you have a cost or performance constraint.

Roelant: That’s the final chapter of the book. So the engine concept is where we start the book with and it explains all these sort of building blocks of what constructs a fully automated machine. And one of the components is optimizer. And we say exactly that, right?

So start virtual, but depending on. Which optimization directive you want to give it, like cost storage, IO latency, whatever it is, it can calculate based on use patterns, what the best combination of physical, virtual objects are to deliver that, right? So if you don’t want cost because you’re running on Snowflake, for example, or something like that.

Some parts because it help you. Directive latency, maybe just don’t do that. The end result, everything for making things work the best way.

Shane: And then you’re gonna have to bring in the segmentation or domain based or business unit based or process based. So you say these data flows need to be low latency.

Roelant: Exactly. How cool is that? Right? I want this mark to be refreshed every minute and this one every hour. Right. And this is literally always say in the book, that becomes a requirement because as a data team you have limited resources no matter how you look it. If you want to optimize that, it becomes more of a management decision and then you get more resources to deliver that requirement and you just.

Update the directions and go,

Shane: so if I took that future state, I’ll virtualize everything and it actually worked. I could also see a future state where you basically pull up the design patterns, you tick some boxes for them, pull up the constraints around which technologies it’s allowed to use, and it actually writes the solution patterns and the code and deploys itself as a fully config driven end-to-end platform from a couple of, there’d be a lot of check boxes, though, a lot of complexity.

Roelant: In our definition, the design patterns would remain the same, but the different solution patterns will be selected by the engine to deliver the updated physical implementation.

Shane: So are you saying that the design patterns become mandatory in your opinion?

Roelant: Yeah. Yeah, because that defines how your solution operates.

The ization, how to do reference data, all those things we’ve discussed earlier. So you need to make a decision how you wanna treat all those big blocks. But for each of design patterns, you’ve got a number of solution patterns for different technologies or concepts or anything. So you can switch between those and then let the solution refactor.

Shane: So let’s explore that a little bit because again, it’s just bringing this idea of agility, which is ability to change at low cost and with speed, not scrum. And so there’s this balance about work done upfront. So there’s times where you do work up front. You don’t need it right now ’cause you know it’s gonna save you time in the future, but it is waste ’cause you’re doing it before you need it.

And in fact, it’s a really interesting conversation with Patrick Lagger at the moment on LinkedIn where he’s saying he actually doesn’t do any work front anymore. It’s if he doesn’t need to build it, he doesn’t build it. So I have to have a whole chat with him now about. Hold on. Doesn’t that just become a whole bunch of ad hoc now he’s got enough experience to probably know how to retrofit the patterns after the fact.

Yeah. Okay. But it is, it’s not off on that tangent. So let’s go a couple examples. So I would say you always want to bring in a ization pattern into your platform on day one, especially if you can just pull it out of the patent library because you are gonna need it. I could then say, if I was looking for.

Feature flags for analytics, for machine learning. So I wanted to be able to create very wide, very long table that was key by customer, that had a whole lot of feature flags, their age bucket, their usage bucket, their income bucket that I was gonna put into a machine learning model to do some clustering segmentation.

I would probably not do that work until I had an information product that needed that work to be done. That would be my standard approach Right now what I think you are saying is if the. Is a design pattern and it is. And if the solution patterns and the implementation code was readily available and proven that box will be ticked by default because at some stage I would need it.

And the cost of implementing it is almost nothing. Is that what you’re saying? Or did I misunderstand that?

Roelant: Partly, so in, in this particular example, I would argue that every delivery of data uses the same pattern. So no matter how you deliver, like everything normalized or one big flat table, that’s all the same pattern you use.

You just. Select different columns, but conceptually, yes. So the way I look at it, you would have a couple of solution patterns readily available because you know, we provide them with the book and the GitHub and all that stuff, but it might not be fitting your technology and all that kind of stuff. So you would then add them as you need them?

I would think so. Good. Looks like having a library of patents that we can all collaborate on that is somewhat centrally available when we can work together as community to improve them. That’s a bit of an ideal world dream. But in the smaller groups, that would certainly be possible. But I think that if your solution patent doesn’t exist for the specific technology that you’re working with, yeah, you, you would have to add it.

And in that case, I would only add it when I need to add it.

Shane: But that idea of a patent library, we can all share and click. That happens in the software domain. Our software NCE have that already. They’ve done the work to make that happen and got the value out of it. In the data domain, we don’t, right? We go and implement a new piece of technology or a new vendor, we go, oh, SQL Server, I’m gonna do Snowflake now.

And we reinvent all the patterns again and it’s, oh. We’re gonna go Databricks, oh, let me reinvent all the patterns again from scratch because I’m changing the technology stack. And that’s just crazy.

Roelant: That’s just waste. Yeah. So much to learn from our friends in, uh, software development.

Shane: Yeah. But, and I still see large platform builds.

I still see projects where it’s six months to a year before the platform’s available to the team to actually build something of value. That has changed a little bit, but not a lot in my view.

Roelant: Not a. A library like that to use and contribute to. Hopefully it, it’ll help get it moving to that direction.

We’ve talked about this before, right? And that those plans, they, they’re still there.

Shane: I gotta say, at least you’ve following this dream for a while. So I can see that actually to write this book, there are at least three challenges, but three big challenges that just come to mind. So one, I. Is building out this mental framework of design patterns and what they are, solution patterns and code that implement those solutions and how you describe them, how you link them, the kind of meta model for that model.

That’s the first problem. The second one is then taking. Your 25 years of experience Duke’s been in in the industry about as long. Yep. So taking 25 years of experience from two people and then distilling that down into those patterns, like just picking them outta your brain and writing them down so that you get this library of all the patterns that you’ve ever implemented and that are valuable.

And then the third one is figuring out how the hill, you tell that complex story in a book.

Roelant: A hundred percent. And it’s taken us almost seven years. Right. So that’s, that’s something. And I do find, as you point out, the data engine thinking is co-written by Dirk and myself, and we’ve, we both worked very, very hard on it.

And the interesting thing is that there has never been a case where we didn’t agree on something. Right. So it’s, it’s like this convergent evolution kind of thing. You work towards something that, you know. And people get there, and not just us, but other people as well. They, you slowly move towards this approach that makes sense in the present.

So it’s been very, harmonious is probably the right word to, to use. It’s, we never had any sort of disagreement on how things should be done. There were a couple of cases where we. Didn’t know something yet. Right. So a couple of error that we run into and say, Hey, actually I never thought about that. Or I wonder what would happen if that that happens.

And then sort of explore that and started coding and wrote a couple of examples. And then certain cases where timelines or time periods right from start, date and date, that kind of thing were missing in, um, not in an expected pattern or something like that. Right. Or what it consequences if the source system operational system doesn’t have that time validity.

And so to really explain what happens then. Oh yeah. That’s interesting, but never that we weren’t on the same page. So I think that’s pretty, uh, pretty special.

Shane: And I also think the fact that you are creating code and applying it against data scenarios to prove that both the solution pattern and the design pattern.

Have legs, right? Are real things that you can describe actually probably helped in terms of validating the book as you go. Here’s the design pattern. Here’s what we think the solution pattern is. Let’s create some code and if it survives and does what the patent says it should do, then we know it’s a real thing versus Absolutely.

’cause there seems to be a few books that have come out over the last couple years. I read them and I go. That’s great theory, but where’s the meat and potatoes? Right. How do I do that? Because that seems great. That seems like Nirvana and we’ve been trying to get to that for the 35 years I’ve been in data, but we’ve never made it.

So just tell me how to do it for once. Yep. Yeah. And this book seems to be taking us to that state, right? Where there is an example. Certainly the

Roelant: And so into junior reckon. Yep. We got the Global Data Summit coming up in ice present. Book and everything, so that’s when the physical copy should be available and everything should be ordered online as well, so that’s gonna happen.

Yeah. But events,

Shane: yeah, I was hoping to be at Iceland this year, but for a lot of personal reasons, it’s not gonna happen. And then I was hoping to be at the Helsinki Data Week, and then they moved it to when I wasn’t gonna be in the uk, so I missed that one as well. I’m happy that. 2025 seems to be the year of the books.

There’s a hell of a lot of really good data books coming out, so I’m hoping 2026 will be the year of good conferences for me. Yeah, see you there. What interesting questions Joe Reese asked on his podcast just lately, and I’m gonna go steal it. Thanks, Joe. Is this idea of how did you write the book? What was the process you used to write this book?

Can you just talk me through how the two of you wrote it together?

Roelant: I’m not sure if we actually follow the process. Classroom training a long time. Then our focus areas are slightly different, but it’s similar stuff, and the idea of having a storyline that supports the more quote unquote materials, the theoretical parts, we agreed on that really early back in the day.

We thought, ah, maybe we don’t wanna make it like a data notebook or anything like that, right? There are some components, but. I started to write the main theoretical chapters and dear, started to write the main storyline chapters and then we would peer review our work and then revising that sort of, then we swift swap roles a little bit, saying Now I’m gonna update that storyline with the theoretical.

Explanation that I have in mind, and then tweak a little bit and then put the same and review and change some of the explanations or, you know, add what he wants to say. So it’s like that, right? So we both write a chapter storyline versus theory, and then we switch rules and then we do it again.

Shane: So effectively you.

Storyboarded the chapter and then wrote in isolation and then swapped to iterate and align, and then would be interesting to see which chapters you had close alignment after the first iteration and which chapters you deviated from each other or more than some of the other chapters would be interesting.

Roelant: Yeah. Yeah. I think especially on the biggest. Work we’ve done is on the definition of what the architecture would be, the different options. So the big boxes that we’ve talked about earlier that took us the most time to really nail that. And the other one is the different ways of looking at temporality and data that also.

Those are the real big things.

Shane: Yeah. Seems simple, right? Realization. It really is. Seems until you, again, until you engage with the first bit of data. And one of the things I found, and it’s only after I’ve done the retro of publishing my book, that I realized how valuable it was, but actually creating a course.

For the content in the book before you write the book. And then actually training people for me was something I hadn’t realized how valuable it was because by creating the content, you are actually drafting the book. You’re just not writing the words, you are drafting I images and you talk it, and then you write the hands on that people do to reinforce it and then.

As you are writing the book, you’re flicking back between iterating the course and the book because you’re starting to learn different way of articulating that seems better or you’re getting feedback on the course that you need to bring in, and so that gives you a safe kind of cool testing ground to actually learn and do at the same time.

Is that what you found? A hundred percent.

Roelant: Yep. So at some point I started recording the trainings for that purpose. And then after the training sessions, I was updating the notes in the slide decks to find the right words. And that all made it into the book. And at the end, I just opened training again and started removing slides from the deck if I was happy enough that it was officially covered in the book until there were.

Shane: Oh, I hadn’t thought about doing that actually. Yeah. Picking up the course material and deleting it when you know the book’s covered it to prove you’ve got the coverage and aren’t slipping stuff in. Yeah. What I actually found was I ended up writing more stuff in the book that I haven’t put back in the course yet.

Yes. Because then I’ve gotta extend the course, make it longer, and I’m like, I’m not sure I’ll want to make it anymore. Exactly. Yeah. Yeah. So a set trade off decision. Yeah. Alright. If people want to get hold of you, what’s the best way for them to find you?

Roelant: Yeah, find me on roelantvos.com. That’s my website with the blog that I have ignored slightly because it was so busy writing the book, but that’s where you can find me.

And also dataenginethinking.com, which has the book website and it has some examples and it has some briefly chapters and some of the personas. So definitely find us there.

Shane: Excellent. And June, it’ll be out on Amazon. No doubt.

Roelant: For sure. Yeah. And we’re also exploring some other platforms, including our own website.

But yeah, definitely Amazon too to start with.

Shane: Excellent. Alright. I look forward to buying that one. And finally getting a design patent template that I can use repeatedly and stop hating the ones that I use currently. So tha thank you for that. At least s gonna be, that’s gonna be the best value that I’ve had for a long time.

Roelant: I generally hope so.

Shane: Excellent. Hey look, thanks for coming on the podcast and I hope everybody has a magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Data Engineering Patterns with Chris Gambill

Shagility — Fri, 13 Jun 2025 05:09:42 GMT

Join Shane Gibson as he chats with Chris Gambill about a number of Data Engineering patterns.

Listen
View MindMap
Read AI Summary
Read Transcript

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/data-engineering-patterns-with-chris-gambill-episode-65/

Listen to AgileData Podcast Episode

Subscribe: Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Google NotebookLM Mindmap

Google NoteBookLM Briefing

Briefing: Data Engineering Patterns Explained

Source: Excerpts from "Data Engineering Patterns Explained" - Agile Data Podcast with Chris Gambill

Date: [Implicit - Recent, given discussion of Fabric and current tech trends]

Prepared For: Anyone interested in modern data engineering practices, particularly those seeking to understand repeatable solutions to common problems in data architecture and operations.

Executive Summary

This podcast delves into the concept of "data engineering patterns" – repeatable solutions for common data-related problems, often fitting specific contexts. Chris Gambill, a data engineering veteran with 25 years of experience, shares his insights on various patterns he's encountered, particularly within the Microsoft and AWS ecosystems. A key takeaway is the dynamic nature of these patterns, necessitating continuous review due to the rapid evolution of data technologies. The discussion also highlights the importance of documentation, context-awareness, and the emerging role of AI in leveraging and sharing these patterns.

Key Themes and Most Important Ideas/Facts

Defining Data Engineering Patterns:

Patterns are conceptualised as "solutions for common problems which fit a certain context." This analogy is drawn from architectural patterns in building design, emphasizing their repeatability and suitability to specific scenarios.
Examples across various domains illustrate the concept: "the way people submit code to get," "the way people peer program," "the five ceremonies of Scrum," or "a four tier data architecture." Each serves as a "solution to a common problem."
Crucially, a pattern's suitability is context-dependent: "given your context, it may fit, it may be valuable, or it may actually be an anti-pattern."

Core Data Engineering Patterns and Their Nuances:

Python Script ETL/ELT with Docker & AWS Fargate (Batch Processing):
- Pattern: Writing a Python script for extract, transform, and load (ETL/ELT), containerising it with Docker, deploying it to AWS Elastic Container Repository (ECR), and orchestrating its execution via AWS Fargate and cron schedules.
- Context/Use Case: Ideal for batch scheduling, "maybe once or twice a day, bigger processes."
- Anti-Pattern: Not suitable for high-frequency loads (e.g., "every 15 minutes" or "five minute load") or real-time streaming, where "you need to probably go the Kafka route or like I said, Lambdas." Fargate costs can be high for frequent runs.
- Deployment: Typically "spin it up and then you're killing it at the end," adopting a "deploy and destroy" serverless pattern.
  - AWS Components:ECR (Elastic Container Registry): Stores containers.
  - ECS (Elastic Container Service): Manages scheduling and resource allocation (e.g., EC2 or Fargate clusters).
- Fargate: Serverless compute for running containers.
Landing Data in File-Based Storage (Data Lake Pattern):
- Pattern: Landing raw data into file-based storage like Google Cloud Storage (GCS), AWS S3, or Azure Data Lake Storage (ADLS Gen2/OneLake).
- Benefit: Enables cost-effective loading into downstream systems (e.g., "if we load from Google Cloud storage into BigQuery, it's free. We don't pay any compute").
- Exceptions: Out-of-the-box adapters (e.g., GA4 to BigQuery) may bypass this layer for convenience.
Data Adapters/Connectors:
- Challenge: Organizations often have a mix of well-known systems with off-the-shelf adapters (e.g., HubSpot, Salesforce) and custom or niche systems requiring bespoke adapters ("bring your own adapter" - BYOA).
- Considerations: Customisations to off-the-shelf packages can render commercial adapters unusable, forcing custom builds. Security and governance requirements (e.g., in cybersecurity, government domains) might favour custom Python scripts over "low-code/no-code" tools due to greater control.
Databricks Serverless Library Management (Bootstrap Notebooks):
- Problem: Databricks serverless compute, being ephemeral, doesn't support traditional initialisation scripts for pre-loading libraries. Every new orchestration effectively starts with an empty container.
- Pattern (Bootstrap Notebook): A "bootstrap notebook" is used to consistently install required Python libraries. This notebook determines if a "wheel" (a local, static file of the library) is available in a data lake (ADLS, S3) for faster, version-controlled installation. If not, it falls back to PyPI.
- Benefits: "one central location for maintainability and for consistency across your full environment so that when you're troubleshooting, you have one place to go."
- "Wheel": A pre-compiled or static package for Python libraries, offering consistency and faster installation compared to downloading from PyPI every time. It allows for version control, ensuring "a consistent version of that library."
Azure Data Factory (ADF) Orchestration:
- Pattern: Using ADF pipelines to orchestrate various tasks, including running Databricks notebooks and other ADF pipelines. Includes built-in success/failure handling, alerting (e.g., Teams messages), and metadata tracking (logging run stats, records loaded, run IDs).
  - Comparison to Airflow:ADF: Can orchestrate within the Azure ecosystem, but may require creative solutions for external systems without native connectors. People often "schedule and forget" ADF pipelines, underutilising its orchestration capabilities.
- Airflow: Often seen as the go-to orchestration tool, especially for hybrid cloud environments or orchestrating disparate systems due to its robust connectivity.
- Databricks Native Orchestration: While not a "native orchestrator" like Airflow or ADF, custom Python notebooks within Databricks can orchestrate child notebooks, leveraging Spark's concurrency for sequential or parallel execution. These can be "table-driven" for dynamic orchestration.

Cross-Cutting Principles and Challenges:

Context is King: The suitability of any pattern is entirely dependent on the specific organizational, technical, and business context. "If you think about it again, that context is key." This includes factors like data residency, security requirements, existing technology stacks, and skill availability.
Patterns Within Patterns: Data engineering solutions are often composed of multiple interlocking patterns. An orchestration pattern, for instance, encompasses sub-patterns for alerting, logging, and dynamic configuration.
The Technical Deployment May Change: Even if the core conceptual pattern remains, its technical implementation will vary significantly across different technologies (e.g., Databricks vs. Azure Synapse). "The technical deployment of a patent may change depending on which technology you use."
Rapid Technological Evolution: The data platform landscape changes at an incredible pace. Patterns need "iterated over time" and reviewed frequently (Chris aims for "at least once every 12 months, if not every six months") to ensure efficiency and relevance. Microsoft Fabric is cited as a prime example of rapid evolution from a "marketing architecture" to a robust platform.
- Vendor Patterns/Strategies:Microsoft (Fabric): Focus on unified control planes, a "one lake" data strategy, and integrated UI. Initially priced to attract large customers as early adopters, it's now more accessible and robust. Microsoft is "coercing people towards fabric in many different ways" (e.g., deprecating Power BI licenses, changing certifications).
- Databricks/Snowflake: A tendency to "announce features that don't exist now" at summits, followed by phased rollouts (internal, trusted users, early adopters, generally available) over a ~12-month cycle.
- Snowflake's "End-to-End" Strategy: Increasingly integrating third-party functionalities directly into their platform, putting pressure on customers to "decommission all the third party products."
DBT (Fusion): As an open-source project transitioning to a commercial entity with VC funding, it's a "common pattern" for more features to move behind a paywall to drive growth and profitability.
- Dynamic/Table-Driven Architecture (Mega Pattern - "Context Layer"):Pattern: Storing configuration, business logic, dependencies, and attributes in relational tables to dynamically generate and orchestrate data processes, rather than hardcoding them.
- Benefits: Reduces "overhead," improves maintainability, especially for inherited codebases. "It's so much more maintainable than hard coding it."
Analogy: Think of it as a "context layer" that drives orchestration and ETL processes, making them adaptable to change without extensive code modification.
The Overwhelming Landscape for Newcomers: The sheer volume of tools, platforms, languages (Python, Java, SQL), and undocumented patterns makes data engineering incredibly challenging for junior professionals. "You don't know what you don't know."
- The Challenge of Documentation & Sharing:Importance: Crucial to avoid "tribal knowledge" where patterns are lost if key personnel leave. "So important I think, to document these things and write them down."
- Reality: Documentation is often an "overhead" for consultants paid hourly, making it less incentivised. Even documented patterns can be hard to find and reuse, leading engineers to "just write from scratch" based on memory.
AI's Potential Role: LLMs could potentially act as repositories for documented patterns, providing step-by-step guidance and even generating code based on project descriptions, helping less experienced individuals navigate the complexity.

Future Outlook

The rapid evolution of data technologies and vendor strategies means that data engineering patterns will continue to adapt. The increasing consolidation of platforms (like Microsoft Fabric and Snowflake's expansion) reflects a desire for integrated, end-to-end solutions. The rise of AI/LLMs holds promise for democratising access to, and leveraging of, documented patterns, potentially mitigating the "overwhelming" nature of the field for newcomers and fostering better knowledge sharing. However, the onus remains on experienced professionals to document and refine these patterns.

Tired of vague data requests and endless requirement meetings? The Information Product Canvas helps you get clarity in 30 minutes or less?

Fix Your Data Requirements

Transcript

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Chris: I’m Chris Gambill.

Shane: Hey, Chris. Thanks for coming on the show. Today we’re going to talk about dain engineering patterns. Before we rip into that, why don’t you give a bit of a background about yourself to the audience?

Chris: Absolutely. So I have spent the last about 25 years in the data world.

I actually got my start in a call center back in 2000

where I quickly became the kind of go-to person for untangling all these Excel spreadsheet nightmares and all these manual processes. And they figured out that I had this snack for automation and data, and I quickly got pushed into a position. That was what I did all the time, and since then it’s really taken off from there.

I’ve worn a lot of different hats from individual contributor all the way through director positions was brought up in the Microsoft stack, so a lot of DTS packages early on. SSIS. Moving into Azure and a DF and Synapse and fabric now and then even added opportunities to touch Snowflake and Databricks and, and even do some migrations from AWS to Azure and back again.

A lot of kind of breadth of knowledge as well as really. Deep knowledge with the Microsoft stack. Most of my knowledge has been self-taught and really tested on real world projects. Right? And then today I run gamble data where I help organizations design scalable modern data solutions, automate reporting, and really bring a lot of order to the data chaos that you see out there.

Right? And I’m also recently proud to say that I am now a Microsoft partner as well. Super excited about that. And that really lets me to stay ahead of the new technologies, ’cause I get some advanced views of different things that are coming out from Microsoft. Super excited about that.

Shane: Excellent. Hey, at some stage you might take a sidetrack into is fabric more than just PowerPoint slides with synapse renamed?

But before we do that and we get into the sarcasm from me, why don’t we jump into the core of the subjects? I suppose if I put some anchoring in place, the way I think about patterns, and I get it out of the. Patent book that came out about buildings, I think about them as solutions for common problems which fit a certain context.

So I can look at the way people submit code to get and say, Hey, that’s a repeatable pattern. I can look at the way people peer program and say, Hey, that’s repeatable pattern. I can look at the five ceremonies of Scrum and say, Hey, that’s a repeatable pattern. I can look at a four tier data architecture and say that’s repeatable patent, because each one of them are a solution to a common problem.

And given your context, it may fit, it may be valuable, or it may actually be an anti-patent. So let’s use that as the anchoring and what’s one pattern that kind of comes straight to mind in the work that you do?

Chris: Yeah, it, so really when I got a chance to work with AWS, this is one of my favorite patterns out there, partially because I think that this pattern gives you a lot of control as to what’s going on with your whole ETL slash ELT processor.

And that pattern is to write a Python script to do all of your extracting, your transformations and your loading. Take that, containerize it using Docker, create that environment and docker and really containerize it and then take that. And AWS has some wonderful elastic resources where you could load it into an elastic repository in AWS.

And then from there, in this example, I generally would use Fargate and I’d run them in the Elastic Services and Fargate. And then we orchestrated that with kind of crime schedules. ’cause most of that was batch scheduling. And that lends itself well to batch scheduling. That you’re gonna do maybe once or twice a day, bigger processes, because if you’re doing anything more.

Often, right, like 15 minute low every 15 minutes, or if you’re doing any type of streaming, probably need to go the Lambda route. But with Fargate and this particular pattern, it lends itself to that. Otherwise, the streaming, you go a little bit different direction because cost can be high with Fargate.

But Fargate does really well with this specific pattern.

Shane: And the thing that I like about the way you described that is I can think of them as Lego blocks. So there’s a bunch of Lego blocks and you’ve racked them, stack them together, and then you’ve described the anti-pattern. So if you are doing batching, if you’re doing 15 minutes or more is the frequency of loads, then this is gonna work.

But if you go into anything below 15 minutes, definitely near real time, but even if you went down to a five minute load, you’re gonna have a whole lot of things you need to deal with. And this may not be the best pattern for that. It would be a more streaming centric type of pattern. Is that how you think about it?

Chris: Absolutely. Absolutely. Yeah. ’cause then you need to probably go the Kafka route or like I said, Lambdas

Shane: and go that direction. And then often these patterns within patterns. So for example, when you talk about this extract and load pattern you’ve got, and you talk about effectively your containerized it.

Then I’d be looking and saying, okay, are those containers running 24 by seven? Or have you adopted a form of a serverless pattern? Are you effectively deploying the container, running it, destroying it, deploying it, running it, destroying it? So there, there’s always patterns within patterns, which make it a little bit trickier.

But what’s your common one for this? Do you tend to run that container permanently or do you adopt a deploy and destroy kind of patent

Chris: in these cases with the batch loading, which usually adopt and destroy. Right? It’s usually you spin it up and then you’re killing it at the end. ’cause that Fargate instance guys at the end of the run anyways.

So it’s a spin up and then burn down at the end.

Shane: And then the other one that’s interesting for me is that you pushing it to elastic. Mm-hmm. So that’s elastic search. Yeah.

Chris: Not Elasticsearch. So Elasticsearch is a little different. So in AWS, there’s a couple of resources called it’s ECS and ECR, right? One is the repository, obviously the ECR piece, and that’s where all your can containers get stored.

And then you have the elastic services, and that is where all the scheduling happens, where you are defining whether you’re gonna spin up an EC2 cluster or if you’re gonna spin up a Fargate cluster, right? And set all your permissions and your security pieces and all that.

Shane: Okay. So that’s almost the orchestration of how you deploy and destroy those containers.

Is that right? That’s correct. And this is the thing, because I only work in the Google Cloud platform typically these days for our product. I’m trying to map your patterns, you know, so I’m going, oh, okay. So effectively we run cloud run and cloud build. That’s the equivalent of the containerization. We run Python when we have to.

Yep, that makes sense. We only use it when it’s 15 minutes or more because we have a whole lot of other patterns that we need. If we’re gonna go to anything that actually transforms faster end to end. And then for me it was like, oh, the elastic thing, right? I just naturally went to elastic search because that’s the last AWS thing I used that had the word elastic in it.

Again, I think one of the tricks with patterns is explaining and clarity what they are, and then actually explaining the context. Explaining the antipas is how I tend to find the easiest way to do it is we know if it looks like streaming. This pattern may not work. Yeah. Find another pattern. Otherwise, you’ve gotta go and actually do a whole lot of detail and describe it in words, and that’s actually really hard.

So how do you do that? How do you actually document a pattern that you’ve used so that you can remember it next time or so somebody else can use it? So

Chris: when I was at and t, for example, I would create these patterns and I would actually document all of it a couple different ways. One, I would write it all out just in a Word doc, save it on in a share.

We had a SharePoint site, this was 10 years ago. I don’t know, time gets away from me, and so save it in a SharePoint. But I would also create PowerPoint presentations of the slides of all the different steps so that people could see, okay, these are the resources that you need for this pattern. This is the description of the pattern.

Then step by step, mostly because people like the step by step ness that you get, then feel that you get with the presentation as opposed to just people glaze over when they’re reading a Word doc.

Shane: I suppose in that step by step means, especially if you put screenshots in, it gives them more context.

People can just go through and see what you are doing and that gives ’em more context or follow the steps and learn by doing, and that makes sense. And then. If I think about what we do, ’cause we have a similar set of patents, completely different technology stack. One of the things we do is we tend to deploy or land all our data in Google Cloud storage.

So equivalent of AWS three. And what’s Microsoft this week? It’s, yeah, Microsoft has Azure data lake storage, and then there’s gen two, right? And then there’s one lake. Seems to be the new fabric. This idea that we’re actually landing the data into file-based storage. And the reason we do that is because we have a pattern where we load from Google Cloud storage into BigQuery, and if we load from Google Cloud storage into BigQuery, it’s free.

We don’t pay any compute. Oh yeah. If we push the data directly to BigQuery, we pay for the compute processing for it to actually store that data. Now if we’re using Google Analytics, so GA four, there’s a out of the box adapter that goes GA four into BigQuery. So we bypass our GCS layer. Then if I think about out of the box adapters, this is where I’m going.

We’ll write a Python adapter effectively when it’s a system that’s not well served. So what we find is with every organization we work with, they’ll have three to four of the standard systems. They’ll have a HubSpot or a Salesforce, the ones we’ve seen before, and we can use an off the shelf adapter for that.

And then they will always have one system we’ve never seen before. There is no adapter. Only a hundred people in the world use it, and we actually have to write the framework to collect that.

Chris: I was gonna say, or even worse, they have a very custom product that they’ve created in house that you have to figure out the create your own adapter.

It’s A-B-Y-O-A bring your own adapter.

Shane: Yeah. And of course, it’s not documented. There’s no schemas. There’s sometimes no API. You’ve gotta go into either the database or the logs. Actually, I don’t mind that one so much. I find the one that does my nut the most is they’re using an off-the-shelf package.

Something like Salesforce or HubSpot. And then they’ve customized it so much that the off-the-shelf adapters don’t actually work. So now you’re actually custom building an adapter and there’s a bunch of companies out there that do nothing but build adapters. I’m thinking five tran data, DDO S3, those kind of companies that do nothing but that.

But let’s go back to that case. So let’s go and say we have a system that you’re gonna need to collect data from. It’s a well known one, so you can use one of the commercial software as a service products to do that collection. So effectively in my head, you are replacing the p and script part of that patent that you talked about.

Is that how you think about it? Do you actually plug and play to a degree when you decide that component of that pattern can be replaced with another one? And how do you decide which parts of your pattern survive? So which ones become almost pets and not cattle? ’cause that’s always a bit of a trick.

Chris: It is.

And I think to your point, if it’s a very. Standard implementation of something that has a plug and play pattern, a plug and play adapter in it where you could just pick up like a DF, right? And there’s an easy connector. It’s straightforward, then you could swap it out for those low code slash no code pipeline builds that you could create.

When I was in cybersecurity, there was some security involved and there was some lack of trust in the low-code, no-code tools. And so the Python was the way to go. And just because, again, we had a little bit more control when it came to security and, and we needed that from a government standpoint too.

And so those are the types of things that you have to take into account when you’re looking at which tool you’re gonna use. Like Informatica does such a great job with from a governance standpoint when it comes to that type of stuff, depending on how regulated your domain is.

Shane: Yeah,

Chris: did

Shane: until Salesforce bought it.

And then how did data podcast recording in 10 seconds For that, I come back to this idea of, if you think about a pattern being made up of a bunch of components, each one of those being a Lego block and you’re building these Lego towers that have fit for purpose. The context, again, of that end-to-end pattern, but also the context of each component is really important.

Because if I dropped into an organization and they already had the extract and load patterns in place, it was the one that you described and then somebody says, oh, just go hit Salesforce or HubSpot. I’m naturally gonna go and wanna look at a commercial adaptive because I go, I don’t wanna write that.

The frameworks are all out there. They’re robust, they’re relatively cheap compared to the human effort of writing one and maintaining it. And so I’d naturally go that way. But at the context of the organization is all data must stay within their boundary and they treat their boundary as a cloud boundary as well.

Then actually there’s no point I can’t bring that adapter in ’cause the context of how we’re choosing these patterns says that’s actually not the way we can do it. And that makes sense. So context is king as always. Absolutely. Absolutely. Yeah. Alright, so that’s extract and load. Hit me with another.

Chris: Okay, so another one is Databricks.

Right? So just to give you a glimpse of kind of what happens in Databricks for this specific pattern. In Databricks, you could choose between two types of compute. You have dedicated and you have serverless. And with the serverless you can’t just create a initialization script that loads all of your libraries that you need into the compute, into the environment, because.

It gets repu up every time something happens. You know, that’s the point of having that serverless where with the dedicated, you can preload all your libraries into that environment and you’re good to go unless you have to reboot that environment. But then you have the knit script that reruns everything.

So just a overview as to what’s going on here. So with the serverless, the pattern that I use because of this is I’ll create a bootstrap notebook that goes through and says, okay, if these are the libraries that. This notebook needs to install, then this is where it goes. Because if you have a wheel in place, it is easier and quicker.

And from a efficiency standpoint, a little bit better than having to go out to Pi PI to download and install that library. So I have a class and a set of functions that goes through and says, okay, this is the library that needs to be installed. It determines whether it is one of the ones that we have a wheel for, and those wheels are generally stored within in this specific pattern.

We’re talking Azure. But it could be, like you said, with the Lego blocks, could be easily replaced with AWS S3. But this one’s a DLS. And so it goes into a DLS, grabs the wheel, does the python dash m pip install of that wheel, and then moves on with life. If that wheel’s not available, then it’ll roll to the pipi version and download and install the pii, but it consistently does the same thing across that whole projects environment.

And it provides you with one central location for maintainability and for consistency across your full environment so that when you’re troubleshooting, you have one place to go.

Shane: Okay, and so this is where you’re getting technical, and I probably need my co-founder, Nigel, who’s the technical, one of the two of us, but really technical.

He is the engineering one. But let me explain this one back and see, because this would be a good test, right? If I understand the pattern, even though I don’t understand the technology. So what I think you’re doing is I think you are templating the initialization of the serverless component. So the serverless component is ephemeral.

You use it. It deploys itself. It destroys itself. So unlike a container where we say, go and then die with serverless, we effectively say, run. And it’s doing that for us. Is that right? Yes, exactly. When we deploy a container, the initialization of all the things we need are embedded within that container.

I’m old, so VMs were a thing when in my days. So we’re effectively deploying this container and the initialization and all the things we need are embedded in that container. So we don’t really have to care. It’s like container deploy, run, container, destroy. And we are in charge of that. That’s right.

That’s the patent for containers. Yep. Okay. And then with serverless, effectively they’re not serverless. ’cause there is a server and there is containers. We just don’t maintain it. We just go run this thing and it knows to. Hit that server, start up that container, run it kill itself. Yeah. So that’s the service pattern.

And then what you are saying is if we till it, when it boots up to then go and deploy all these scripts, it works, but it’s not optimal. Whereas if we effectively have a template and we say, whenever that service is invoked, go and use this template, we get more control. Is that what you’re saying?

Chris: So with Serverless, every time a new orchestration runs, it’s almost like you’re deploying a new empty container.

And with containers, with like Docker, you are telling within that Docker script, you’re telling it, okay, these are all my requirements. It provides this initialization script that runs throughout those requirements with Databricks serverless clusters. It doesn’t accept that there’s no ability to do that.

So when your orchestration notebook kicks off, if it’s not one of the pre-built in libraries that Databricks has loaded into its serverless environment, you have to install it. You have to do either do a PIP install with Databricks, there’s this magic key, right? That’s a percent symbol. And you could say, run, and then basically do a command line command there, and you could do a PIP install, but then it becomes inconsistent across your environment, right?

’cause every engineer that goes through and creates a notebook might do it slightly differently. They might not know that there’s a wheel or not. They might do a magic and the run and PIP install, or they might do a command line and say Python dash m, and then do it that way. So to provide consistency every time that those notebooks kick off, then we have this kind of bootstrap notebook that.

Does that and provides that consistency and the maintainability across our whole serverless environment. Does that make sense?

Shane: Yeah. So if I think about this idea of a data platform that has a serverless component, and it’s called a persistent component, it reminds me of one of the many iterations of Azure where we had synapse, serverless, and synapse, what was it?

Pull the old PDW stuff. Yeah. Yep. So that pattern of a serverless component and a persistent data component seems the same pattern in Azure with synapse as it does with Databricks. But I think what you’re telling me is if I was gonna deploy. This pattern, it works for Databricks. ’cause Databricks technically works slightly different.

Chris: Mm-hmm.

Shane: But would it, would I do the same thing in Azure Synapse, or does it actually allow me to solve that problem by just telling the serverless component to initiate these things?

Chris: The nice thing about Synapse is that synapse, when you’re deploying a pool, you could tell it and give it an initialization kind of script, right?

Of these are the Python libraries that I need loaded for this. Whether it’s serverless or dedicated, whichever the pool you decide. But I think the spark pools there are all serverless. You could predefine it, but Databricks doesn’t allow you to do that. You have to either do it in every notebook. That you create that’s gonna run or you create this kind of bootstrap script, that notebook that does that, that you basically are just depending on to your notebooks that you create for all your other ETL jobs.

Shane: And that’s key is that. The technical deployment of a patent may change depending on which technology you use. And there’s always something slightly different, which is why they have competitive vantage or not. So you’ll be really clear about that. When you pick up this patent and you’re putting it into a new place, you need to actually understand there may be something I need to technically tweak.

And also over time things change. So for example, we used to use cloud functions for our transformation code. So we would use a deploy and destroy on our transformation code as well. So effectively it would read our context engine, it would work out what it needs to do. It would write the code effectively or hydrate it ’cause it was standardized.

It would deploy it to a cloud function, it would run, it would then finish, and then we’d go onto the next one. So with daisy chain and these little globs of code, but the manifest of how they built were dynamic back then. Cloud function would only run, I can’t remember, for a minute or three minutes. So you get these ones where if you were waiting for it, it would just terminate.

We had to do a fire and forget pattern. Then cloud functions went to nine minutes, I think, and most of our small chunks of code would run within that. So then we had to re-engineer the code to just use it. And then we had to move from cloud function to a whole lot of other new things. ’cause one of the downsides of Google is they love to deprecate their products whenever they feel like it.

So the point there is maybe the core pattern itself isn’t changing, but the technical implementation of it needs to be iterated over time because data platforms and technologies are moving at such a pace that actually you always have to go back and review your technical patterns, don’t you? To make sure they’re still the most efficient way of doing it.

Chris: Absolutely. And from my perspective, I try to go through those patterns at least once every 12 months, if not every six months. Because to your point, things change so incredibly frequently, especially if you take fabric for example, right? Fabric 18 months ago is not what fabric is today by any stretch of the imagination.

I initially looked at fabric and was like, this is a marketing tool to try to push people into some new environment and try to raise their cloud bill. In fact, that’s 100% what it felt like when our Microsoft rep came and talked to us about it and I was like, how does this save us money? He said, operationally, we’ll save you money.

And I was like, but can you gimme a cost difference between what we’re paying now and what we’d pay on fabric? And he’s like, you’re not gonna see a decrease there. You’ll see a decrease in man hours. And and so I was like, okay.

Shane: Yeah, I remember I got asked to come into an architecture review for an organization that was deploying Databricks and somebody you obviously came into the organization and said.

It’s going Microsoft Fabric. It was quite a few years ago. And so you engaged with the Microsoft team actually. ’cause here in New Zealand, they’re a great team. They’re really helpful and said, yep, show it to me. And we had nothing but PowerPoint and you sit there going, ’cause I used to work for Oracle many years ago, so I’m like, I did a lot of PowerPoint demos back there.

Been told that the product was real. But I suppose my view back then was this is gonna go either one or two ways. This is just gonna be market architecture, right? They’re not actually gonna do any changes under the covers. We’re just gonna rebrand them and re cobble together all the quiet, disparate technologies that had happened by that time.

Or they’ll power BI it. Mm-hmm. And they will actually re-engineer it over time and they will make it cheap as chips and they will completely dominate the market where you’re a Microsoft customer. Why you wouldn’t use fabric. So I think we must have looked at it back then at the same time when it was just PowerPoint slides.

But you’re saying now it has changed. You’re saying that actually there’s some meat and potatoes under the PowerPoint now?

Chris: I, I think it, especially now, I will say that Microsoft right now is definitely coercing people towards fabric in many different ways, but it’s also grown a lot. Yes, absolutely. At one point, what was the phrase that you used?

Something about marketing.

Shane: Yeah.

Chris: Architecture. Yes. I love that. Okay. I might, I’m gonna steal that. At first it felt like architecture, right? It 100% was, let’s take a DF. Let’s take Power bi. Let’s take this one lake that we just started implementing with Power BI, and let’s put it all within. This one, basic, gooey, right?

And that’s really all it was. But now they allow you to spin up a SQL database, and that’s relatively new in the last three or four months. They have this new version of A DLS that is one lake, and that kind of helps and the interactivity between all of it, because it sits physically closer in their servers, then it runs a little bit faster.

So yes, from a orchestration and from a operational standpoint, it’s a little bit more efficient and the cost structure has significantly improved. When it first came out, the cost structure was. Frankly, in my opinion, a little bit ridiculous. The minimum was 10 grand a month or something like that, and it was nuts.

But now you could spin up like a F two and it’s much more reasonable in cost than it was 18 months, two years ago.

Shane: And one of the interesting things about that is if we think about technical patents, and actually Vos and Duke Learner got a book coming out this month and it, they talk about design patterns, which are the really big Lego blocks, the core meta patterns to a degree, and then solution patterns, how you can actually use them.

And then they’ve actually got technical patterns where they push code to get, to show you how it works. We can look at vendors and we can see patterns. So I’ll give you three examples. So Microsoft, what I saw them do with fabric was a couple of things. They started focusing on the control planes. I. As you said, they focused on how do we have one UI that can actually orchestrate or help us build and orchestrate everything?

How do we have one lake? How do we have one place where there is effectively a catalog, a Unity catalog that has a control plane where all pieces of data go through, even just from a context or metadata point of view? And that makes sense, right? That’s like getting the Foundationals in because really if we got to the stage where I could use that UI or those control planes, and it goes and determines which of the many technical components that’s spinning up, what’s the best, most cost effective and performant thing for the job that’s about to be done and I don’t have to care.

That’s a great pattern. And then the second thing that you talked about is because they were early, they need some early adopters, and they need people, they can get good feedback, and they don’t wanna open up to the entire market because the product’s not ready yet. So what they do is they put a price on it that only certain large customers can afford to stop everybody else using it because they know it’s not quite ready.

And I’m not taking a pot shot at Microsoft. That’s a great pattern to deploy and pivot your product. But here’s another pattern that we’ll see in the next week. Databricks and Snowflake have gotten to a habit that they did what used to happen when I was at Oracle. They will announce features that don’t exist now.

Mm-hmm. Yeah. At both their summits, they will announce features that may turn up in 12 months and then they have a pattern. Yeah. It’s ah, they all use different words. Right. But there’s the internal one and then there’s, there are small bunch of really trusted users, and then there’s the slightly broader, but you still have to be invited.

And then there’s the early adopters. It’s open but don’t trust it. And then there’s the, this thing’s scalable. It’s been tested by their customers a lot, and that’s normally a 12 month cycle. So when we see these lovely architecture announcements right now, while I get grumpy with them, I look for the patterns of, okay, where in their deployment cycle is that piece of code?

How does it fit with their strategy? And then the last one, ’cause it’s topical right now, is Fusion from DBT.

Chris: Mm-hmm.

Shane: Yeah. Oh, I remember when I was, again, Oracle, I’m sure we had a product called Fusion, and then it got called confusion, which I see a lot of DBT competitors now starting to use. But you think about their problem, right?

They have open source product. They’re not making money on that. Everybody uses for free that everybody expects them to keep developing and giving them features for free. They are now a commercial company that has VC funding and actually has to get growth or profitability. Right? That’s the two things they have to do.

Yeah. I don’t see why the market’s surprised that we are now merging those two things and making them more and more paid. That is a common pattern. Yeah. I mean it could be worse. It could be Elastic search, right? It could have been AWS picking it up, calling it an AW WS product and not actually letting DBT get any of the revenue.

I’m not sure that’s a better pattern.

Chris: It might be Salesforce’s next acquisition.

Shane: What? DBT? Ah, no. They need a database, right? My They do. I was thinking about it the other day. Teradata, Salesforce has to buy Teradata.

Chris: Yeah. Because that’s the next tool that we don’t really love using, right? Is they went through and they, Salesforce bought Tableau, which is the last generation of BI tools, like everybody uses Looker or Power BI or all these other visualization tools, and they’re like, let’s buy Tableau for a huge amount of money.

And then they’re like, okay, let’s buy now Slack for a blue billion dollars. They bought Slack, they bought Tableau, and then now they’re gonna buy Informatica. And these are the stepchildren of the tools that we love using. And so maybe, like you said, Teradata feels like the database of choice for our stepchildren of databases that that people, well,

Shane: especially ’cause Teradata actually re-platformed their on-prem database to be cloudy and they brought their price point down.

They just missed the market by a couple years. So if we look at patterns, let’s look at the Lego blocks of an end-to-end platform communication. Slack visualization, Tableau, they’ve got their AI thing, right? They’ve now got the transformation and the data acquisition. They’ve got MuleSoft, they’ve got the streaming version of that.

They just now need storage and then they become. End to end. Yeah. It’s a pattern

Chris: they become the oracle of, of today. Right? They,

Shane: yeah. Although I do remember my theory back then when Oracle was buying everybody was, they bought their competitors because they couldn’t wanna get some. The reason I moved into data was I started out in ERP and financials, and we used to get our asses kicked by PeopleSoft all the time.

Chris: Oh yeah.

Shane: And yeah, like just every deal, it was like our product was ugly and their product was beautiful. It was like, it was as easy as that. And then my view was Oracle bought PeopleSoft to kill it. It was, we can’t beat you, so we’ll buy you. And that was actually a, uh,

Chris: hostile takeover.

Shane: Hostile takeover, yeah.

Poison pills and cool names like that. Anyway, let’s go back to, uh, technical patents. I go, oh, before we do that, actually, what the hell is a wheel?

Chris: A wheel? Okay, that’s a good point. So there are a couple different ways that you could install libraries, Python libraries. One is you could go out and basically you’re gonna phone a friend, right?

You’re gonna call Pi Pi, you’re gonna call this pi, which is this big repository of all these Python libraries and say, Hey, this is the library I want. I’m gonna download it and I’m gonna install it. Or you could create a wheel. And a wheel is basically a local file of the library that you could save them locally on their computers.

They, you could save them in a an S3 bucket or A-A-D-L-S container. And then that way you have a static install of that library. I think it’s probably the easiest way to explain it.

Shane: Okay. So it’s like, in my head it’s what we call a manifest, so mm-hmm. We effectively go and grab all the transformation code we need for a job.

We then basically build out that Python or SQL step series of steps. We then keep it as a manifest, and then we run that manifest and then we destroy it. But what you’re saying is it’s basically a manifest, it’s a list of instructions that are stored somewhere that you can make it reusable and it’s So is wheel a databricks term?

Chris: It, yeah. No wheel is, wheel is just a python. Like it’s a, it’s like an executable for Python libraries, right? It, it holds all of the instructions for that specific Python library that you’re wanting to install. And the advantage of having a wheel that you use to load every time is that one, you’re not having to call and download.

File from pi, right? So it’s a little bit quicker, more efficient, but also you have a consistent version of that library. So every time it, if you call pi, it’s gonna give you the latest one every time, which can cause issues in your ETL if there was some major change in the version, in the updated version, right?

And so you download and use these wheels to do your installs, and that way you wait until you’re ready to install the latest version and you’ve done any type of regression testing and made sure that you know everything’s gonna run correctly on the new version. Then you could download the new wheel and then save your wheel to your blob container, wherever it may be.

And then. Reuse that,

Shane: and that again, it’s a pattern you’ve just described, a small pattern that solves a problem with a context. And one of the things about patterns, even we’re talking design solution or technical patterns, the language changes and it’s all about context. So the example I use regularly from that book about buildings is if you have a lounge, a sitting room, and you’re in a sunny place, you’re typically gonna have big windows.

If you have a bathroom, you are typically gonna have a small window, and it probably is gonna be a paque. I’ve got a friend that lives in Iceland. Their windows are very thick because of the cold, so their design of their houses are very different. Got another friend that lives in one of the most beautiful spots in the world on effectively a hilltop, and nobody overlooks them.

So their bathroom actually has a shit ton of windows and they’re not opaque because, oh wow, they want light in their bathroom and nobody is around to actually see in, in theory. So if you think about it again, that context is key. And that example you gave of a wheel, when does versioning and making a static copy of those libraries have value versus when do you actually wanna reach out and grab the latest version?

It just context and choices in every pattern is important to do that, and language is important. Excellent.

Chris: You’ve got another one. I do, I have a third one and this one is more Azure based, so it’s more that kinda low code, no code. ’cause we’re gonna use a DF and in Azure, one of my. Patterns that I use for, or even orchestrating in this case, pipelines in Azure Data Factory and even Bricks Libraries or Databricks Notebooks, is you can actually orchestrate all that with an A DF pipeline, which is really cool, which is nice.

You don’t have to go out and do an airflow thing, so if you’re in an environment where you don’t have access to airflow, then A DF provides you availability of this pattern. And so basically you’re creating a pipeline where you are dropping in these tasks into your pipeline that are run notebook. You could run it, like I said, you’d run Databricks notebooks, so you’d run Databricks Notebook and then you could drop another task in that’s Run Pipeline and you define which pipeline it’s running.

If it succeeds, then it moves on to the next. Pipeline that it needs to run within that orchestration. And then if it fails, then you could, then you’re going to, and this is one of those patterns within a pattern where if it fails, then you’re gonna go through this alerting process. You’re gonna send an alert.

Generally, I like to use teams, people that are in that Microsoft environment because teams, people get on their phone, they get, whether when they’re in front of their desks, email, they may or may not get, but teams message, whoever’s on call, they could make sure that they’ve got those alerts on and they are getting those alerts that says, Hey, I.

This specific pipeline failed within the orchestration. We need you to go and check it right now and super high level, that’s what that pattern looks like. Then at the end of it, after each pipeline, it actually goes through and sends the metadata for that pipeline to a table to say, okay, this pipeline completed successfully or didn’t, if it did fail.

And then we have the metadata tracking as well that goes along with it. So there’s a lot of configurations there, and probably I’ve just named three or four different patterns within a pattern, so I apologize.

Shane: Oh, you definitely have in then. And then I’m assuming that actually the log from every run goes to some kind of file based or log based storage.

So you can always go back and forensically see what SQL was run or what code was run, and then you probably aren’t versioning the data, you aren’t keeping copies of before and after data. Yes, correct. And yeah, so that, again, that orchestration pattern, I can look at that and go, yeah, directed graph. Yeah, that is the orchestration, flow alerting, pushing that out to different channels effectively, depending on the urgency of the problem.

Metadata or or storage of all the run stats. ’cause you’re gonna need those, I’m assuming you’re gonna have runtime number of records loaded. You’re gonna have all the stuff that you’re gonna use later, pushing the actual logs from the run somewhere else that you’re just gonna keep as cold in case you have to go in and forensically figure out what the hell happened.

You’re probably gonna generate a run id. So you can see that this set of code ran it this particular time, and then you’re indexing it. If I think about what we do, we do a lot of those patterns, but we do one thing really differently. We don’t use a directed graph. So we don’t go in and actually hard code the flow.

What we do is we dynamically generate it. And that’s a decision we made right at the beginning. Right now, that is a completely different pattern, but everything else we push to Slack. We don’t push to teams. We log everything like you do. We have a metadata or a context, a repository where we can see every run and every ID and every table that was loaded.

We get performance stats, so we go, that one’s taking too long or costing too much. All those patterns for orchestration. Out of interest on that, just again, back going to technology, EF has been around Azure Data Factory has been around for a long time. It’s a fairly mature product and it’s a mature product that’s survived many architectures, but many actual architecture changes within Microsoft.

So it is a core component. I would have a bet that it’s one of the components that won’t disappear, and yet I see lots of people doing Azure deployments using airflow. Now, in the old day, A DFI think used to be a full fact kind of client, but now it’s serverless, isn’t it? It’s more of a serverless type of behavior.

So we get the benefits of that. So why would somebody deploy a technical airflow pattern for their orchestration in Azure compared to just using a DF?

Chris: There’s a couple reasons. One, I think that people don’t see a DF as an orchestration tool. I think that airflow is the big name out there. I think A-C-D-O-C-T-O gets a phone call and is, Hey, let’s deploy airflow.

And they’re like, oh yeah, that’s the thing. We’re gonna use airflow without realizing that. I think so many people just schedule and forget a DF pipelines and don’t realize that they could actually do a lot of the orchestration within a DF for one and two there. There’s definitely some additional complexity that can be involved that airflow lends itself to.

I’ve run into companies where they don’t do everything within their Azure environment, right? Where they have different pieces that are running different code in different places, and. Airflow does a great job of bringing that all together and allows you to orchestrate it from one place where a DF, maybe you’re calling an API to get something to trigger somewhere.

Or maybe you are having to creatively figure out a way to kick something off in some other system somewhere that doesn’t have a native correct connector within a DF. So that conversation that we had earlier about native connectors and whether you go custom or you go use the native correct connector, right, where Airflow has a lot of those connectivity pieces built in already.

The other thing is that. A DF has come a long way, I think, especially like a couple years. Partially, I think because Microsoft knew that they were gonna push it into fabric. They needed to make it a little bit more robust. They needed to get it to a place where they felt like, okay, this is a great tool for fabric.

And to your point of it’s not gonna die anytime soon. Just like SSIS packages are not gonna die anytime soon because I’m shocked that they haven’t, but they haven’t. There’s that piece. I also think that, I’ve run into a lot of companies that they picked up airflow because they’re like, we’ve got Databricks, we’ve got Azure, we’ve actually got this small AWS account over here in this corner somewhere, and we needed to orchestrate all of it.

And that’s easier done with airflow than it is with a DF.

Shane: Does Databricks have an orchestrator?

Chris: A native

Shane: one.

Chris: Databricks has, I will say that in one of the companies I’m doing consulting with, we’ve written a custom notebook that does the orchestration. And so it’s not a native orchestrator like you would see it in airflow or even five Tran or a DF.

It’s all written with Python and we’re kicking off child notebooks within the Orchestrator notebook that we’re calling it. And then we’re running those either in sequence or we’re running them in parallel because we have that great kind of concurrency and parallelism that we could use within Spark.

But it’s very custom and it’s very created. Right. It’s a custom workaround.

Shane: Alright, and then with that master notebook, the controller of the controllers, are you hard coding the things it’s calling? In that notebook, or is it dynamically looking it up and generating that manifest from somewhere else?

Chris: So it’s dynamically looking it up based on a couple of things.

We have a table that says, okay, these are the notebooks that we, that need to be run. And then that way it’s not a full SDLC, right? It’s not a full lifecycle. Every time something has to change, we’re just adding it to this table that’s sitting in a synapse SQL database. Then it can dynamically. Just change.

If we have a new notebook, then it’s a little bit less overhead and we’re just dumping it in there.

Shane: Yeah. The reason I ask is there’s this mega pattern. I haven’t got a name for it. I got one now and I’ve gone through many names and I’ve hated every one of them. None of them stuck. So I used to talk about config driven.

We had a table, uh, who config. Everything was driven off that config. So that table you hold of all the things that should run, you’re gonna have some tagging in there. You’re gonna have a whole lot of attributes, context, and your mega script. Mega notebooks gonna go through and say, Hey, I’ve been told to run this.

What are the dependencies? Effectively, it’s gonna look up that config table that didn’t stick. And then I looked at active metadata. That was a term in the market and that didn’t stick. Then I tried semantic engine and that didn’t stick. I just saw vault speed said, ’cause they effectively use context driven ETL, and they called it model driven or something.

And I was like, yeah, I hate that. Actually Kwan Cicada posted something around semantic layers and ontologies as he does. And then I had a comment in Chris tab Trumpton as well. And right now I’m stuck on this idea of a context layer ’cause effect. That’s what you’re doing. You’ve created this sequel table of context.

And we have context orchestration, we have context for ETL. We have context for the business glossary. It’s all context we use. So I’m playing around with that right? Context layer is this mega patent. So that’s why I was asking ’cause I was like, I could just see this idea of store the context, use the context to dynamically generate the thing that needs to run rather than hard coding it into that notebook.

Chris: And I’m a big fan of that specific pattern where you are using table driven architecture to do things for a lot of reasons. I think it’s a great pattern for cutting down on new projects that are initiated just because one small piece has changed. I’m a big fan of doing that with business logic, having business logic that is table driven because business logic changes so incredibly frequently.

And you have a table, this is where you’re putting your calculations, where you’re putting all your relationships and everything and you’re driving all that business logic from. This table driven architecture as opposed to hard coding. Hard coding makes me wanna pick up my trash can and then puke in it.

Shane: Yeah. And again, that pattern of hard coating’s hidden. Mm-hmm. So I remember we, we were back in Oracle warehouse builder where we used to hard code the graph, the dag, we would say This no goes in this, no. Alteryx came out and we are hard coding things coalesce comes out and I’m looking at it going, we’re hard coding it.

It may be pretty and it may look like you’re fast, but you are hard coding that thing and it’s not dynamic. But then I remember when we went from Oracle warehouse builder to Oracle data integrator, when they bought. Oh, that French company, and that was object orientated. You just created these one-off objects and then it built everything dynamically and nobody understood it because it was just too hard.

I’m assuming the two people that wrote the product that got it. But what I saw was every consultant we had would just basically hard code a DI packages to make them look like W warehouse builder tags that data flow. So I think this idea of context driven, we have context about business logic, we have context about orchestration.

If we store that as. Physical data and then use that to generate or hydrate what we need. It’s a much better pattern.

Chris: It’s so much more maintainable than hard coding it, especially when it comes to something that you inherited, that somebody else inherited, that somebody else inherited. Reading through all that hardcoded logic becomes unbearable sometimes.

Shane: Yeah, and then actually we still need a map. So while I’m bagging on data flows and these directed graphs and these nodes and links is not the ideal way of configuring what we do. Surfacing it as a picture really helps giving person a map where they can say, show me the nodes, and links the steps. And that orchestration or that steps and the ETL that has value, but it gets hydrated from the core context.

It’s not the actual context itself. And so if we think about that. You talked about your 25 years in the data space, there’s a bunch of people I can talk to that talk patterns, whether they know it or not. And when you talk about this is what a pattern is, think of as Lego blocks. They just, that’s how they think.

And then if you think about what we’ve done today, we’ve gone through some really high ones, down to infinite detail. We’ve said, Hey, if you’ve got orchestration now you’ve gotta worry about the pattern of where you store your notifications, which channels you’re pushing to, what’s orchestrating, how do you build your context engine so that it’s dynamic orchestration, not hard coded.

Ah. Now we need to log every log. How do we pass those? How do we get alerts? How do we hints on what failed? How do we schedule it? How do we daisy chain up? There’s all these patterns on patterns. And if you come into the data world and you haven’t done all those, it must be so overwhelming because where do you start?

Oh, I’m gonna choose between Azure data factory and airflow. So you come in there. And every decision has an impact. ’cause if I pick Azure Data Factory over airflow, I’m now already made some trade-offs because Azure Data Factory means I can’t, in theory, move my platform to AWS or Google Cloud, which I, in theory can do with airflow.

I still go bullshit on that because the cost of change is so high. But in theory, I can, or as you said, if I need to punch out to a whole lot of systems outside of Azure, you know that you are pretty much gonna go airflow. But you have to know that’s the lever, that’s the trade off decision. And so we don’t document these patterns, we don’t share these patterns.

We share reference architectures for a vendor, but we don’t share them for patterns. So it must be hard if you come in without all the experience to actually work all this out.

Chris: Yeah. Yeah. I, and I think for new people it’s already overwhelming because of all those different tools that are out there, right?

Especially from a data engineering standpoint, the vast, um, number of different tools and platforms and languages because maybe 80% of the market out there is using Python, but there’s another 20% out there that’s using Java from a data engineering standpoint. And lemme tell you, that is not an easy switch to go from Python to Java, both our OOP, which for those of you that may not know object oriented programming, but Java, you need a higher kind of technical aptitude for it than you need for Python.

SQL to Python to Java is probably a good building block, but you can’t jump straight into Java unless maybe you didn’t know anything to begin with. And there’s so many tools and it’s one of those things where you don’t know what you don’t know. So much out there. At and TI did have a kind of a repository of patterns that we would use for different situations and a lot of that was driven by what was whitelisted and what wasn’t.

So what was allowed by the company and what wasn’t. And I think even then, because there were like 80 different patterns in there, even that could become overwhelming for somebody that’s for a junior or a fresher that’s stepping in to that role. And they open up that repository and they’re like, oh my gosh, I need to learn all this like yesterday.

Shane: And I didn’t see the consequences. I remember I came into Rescue Project in New Zealand and they had, they’d gone map r. So it was back in the Hadoop days where we had a choice of native Hadoop, Cloudera, and Mapa. There was one more, wasn’t there? Oh, I can’t remember. And at the time, the organization wanted bypass all their, uh, procurement process.

So they got one of the shiny suits, large consulting companies to come in and do a quick market scan for them, which is how they used to bypass the procurement process and government back then. And funny enough, that consulting company picked MapR, which they happened to be the only implementation partner in New Zealand.

Fancy that. And when we came in, everything was being hard coded, which is a problem. We wanna move it to this context, config pattern, table driven. And so we needed to do some development and we’re like, okay, what’s the choice? And the choices back then were Python or Scarla.

Chris: Mm-hmm.

Shane: And it’s okay, how do we make a decision?

And we said the Mapar team are all scarlet experts. Yeah. So that’s what we should go with. ’cause we really didn’t care at the time and. The unintended consequences of that was massive because there was no SCADA skills in New Zealand. Mm. Everybody was ever doing R or Python. And what that meant was we could only use the MAP R team.

And actually the MAP R team didn’t quite have as much SCADA skills as they said they had to pass it back to the US and we ended up with an offshore, like we always knew it was gonna be offshore because we knew the Mapar team was supposedly domiciled in Australia, but actually their core skills were actually in the US and we had this massive time zone difference and that one decision Python over scaler without bringing in the context of can we buy skills for that in New Zealand if we had to.

It had so many unintended consequences that caused major problems with that project. So again, it’s hard, I think actually in the new Gen AI, LLM world. I think we have the ability now to potentially document these patterns. And especially you talked about ages ago, this idea of step by step with screenshots.

If we do light context of what the pattern is in some kind of structured format, some kind of templated way, and then we do the step by step and we put that into an LLM as context. I think we have the ability to give people who aren’t as experienced to ask a question and get something back that may help them get through that whole noise, get the signal from the noise.

’cause like you said, as soon as we take a mega pattern and we break it down and then we bring in all the technical implications of every different flavor of technology, we can do that. The challenge is nobody’s gonna write them like who’s gonna write that repository for free?

Chris: Yeah. Yeah. No way. Because it’s a massive undertaking.

To your point, I think that’s what people try to do with Terraform too, right? Is they try to productionalize all these little different pieces and then like at and t, we had a system where we checked a bunch of boxes for our project and then hit the button at the end of it and it would go through and grab all the Terraform scripts that it needed in order to create and spin up all the resources that we needed for that kinda larger project that we’re working on.

But similar kind of idea, right? We have an LLM that has, in memory, that’s been trained on all these different patterns that we have out there. We allow people, especially Freshers, that are learning, right? To go through and just. Really just talk to it. If you’re using like a chat GBT or something, you could literally just explain your project verbally to chat GBT or whatever, and then it spits out, okay, these are the patterns, and gives you the PowerPoint presentations or whatever the case may be, and shows you, okay, this is what I’m gonna be doing.

And then actually even just does it right. Give it some configuration pieces and allow it to write a lot of it for you.

Shane: And then you have to go

Chris: through and come through it. But

Shane: yeah, and then you have to go, what’s the consequences of it getting it wrong? And do we get it right ourselves? And as I said, I’m not technical.

Nigel does all the engineering stuff, but I have a bunch of patterns for working with teams. I can drop into a team and I can observe for a while and go, okay, that’s probably where their biggest gap is right now. And then I’ll bring some ways of working patterns. But even then, I forget half the patterns I’ve done before.

I’m like, oh, I’m sure I had something that kind of helped the team solve that problem, and then I’ve gotta go search my repository. Do you find that if you talk about the at t stuff, the patents that you had, I’m assuming they’re now either in your head or you’ve documented it lightly. So when you go and help a customer, you’re Ah, okay.

There’s a problem, I have a pattern for that. I’m gonna grab that. It’s 80% of what I need, and then I’ll tweak it for their context. Yes.

Chris: And that’s exactly it. And when I left at and TI, I try to go through and write down, okay, these are all my patterns that I use pretty frequently. And then you tweak ’em because otherwise it becomes this tribal knowledge.

And you get that with a lot of companies too, where you have these patterns that have been used for decades and there’s that one person that knows what that pattern looks like, and then they get hit by a bus tomorrow and that pattern’s gone forever. So important I think, to document these things and write them down and actually physically go through and take the time to document it and write it down.

Otherwise you don’t want it just stuck in your head or even in a process somewhere. You want it. You wanna document those things.

Shane: But do you find that even though you’ve documented them, likely you still struggle to find the pattern sometimes yourself, that you know you’ve documented? Sometimes.

Chris: I think this is the pattern that all programmers fall into is sometimes we’re really fantastic at organizing things, and sometimes we’re doing it on the run and it gets dropped into our temp folder somewhere, and we’ve got those 300 files sitting in that temp folder and who knows where they ended up.

Shane: And in your consulting business, have you brought on five new consultants to work with you? There’s no way that you can actually efficiently share those patterns with them on day one. It’s gotta be a incremental thing, which is, oh, it looks like we’re struggling with that. Here’s something that we’ve done before that seemed to fix it.

You iterate it, you pick it up and change it, make it better and push it back. But I don’t see a lot of the, that’s an overhead. And so the downside of that as consultants as well, is that makes you more efficient. And if you are paid by the hour, it actually costs you money to document it and then you get paid less for implementing it.

But if you go to a value based or you become fractional, then those patterns are key for that ability to deploy something quicker and faster and easier and still get paid for the value of it is where we should be aiming for people that are experienced and have that kind of patent repository.

Chris: To that point, like I said, I try to organize things and then some, and I do a lot of fractional work, right?

And a lot of my consulting is fractional, and so I have some old patterns that I could reuse, and to the point of bringing on a subcontractor or somebody, then I could hand somebody, okay, this is my Azure folder, and so here are my Azure patterns, or here are my Databricks patterns. But then I have those stragglers.

To your point of that, they didn’t make it into the folder because they’re somewhere else on my external drive somewhere and, and then I’m like, where did I put that? I know I’ve done it 50 times and yeah. Yeah.

Shane: Yeah. And that’s why a lot of engineers just write from scratch. ’cause it’s actually easier to write the pattern from their head.

Mm-hmm. The way they’ve done it 10 times before than it is trying to search and find. And that’s when it’s your code, it’s when it’s your pattern. Again, when you’re trying to share somebody else’s, that knowledge of the pattern and the context and the antipas is really hard. So I think one of the things, if you’re an organization and you’re looking to hire somebody and that person is saying they’re experienced, then you should actually just ask them to describe three patents.

If they can’t do that, then they’re even not as experienced as they said. Alright, just a fabric. You reckon it’s, it’s there. Now, would you recommend to a customer, if they’re a Microsoft shop, they should include fabric along with Databricks and Snowflake to make the triad that they investigate or not?

Chris: I think it’s starting to be, I think it’s finally gotten to the point where.

It could be. I think it highly depends on what your patterns are to the point of our whole conversation here. It definitely depends on what your patterns are that you’re actively using in your business. And I mentioned earlier that I. Microsoft is really pushing people into fabric. They discontinued the big Power BI license that allows people to have this Power BI environment, and they’re pushing people into fabric.

You have to do the fabric option now. They forward. Data engineers, they discontinued the DP 2 0 3, the Azure data Engineer certificate, and you have to go to the DP seven, DP 700, which is the Microsoft Fabric Data Engineer certificate. And so Azure has come through and taken the stance of really pushing people in that direction.

On the plus side, they’ve also made it so much more, more robust than it was when you and I probably initially looked at it a few years back, and so there’s definitely some value to having everything all in one place, especially if you’re already using a DF anyways, then you could migrate a lot of that stuff over.

It’s not a push and play, which would be my feedback back to Microsoft, is allow people to be able to just migrate their whole A DF environment into fabric. That would be the optimal solution there in my perspective.

Shane: I think, again, if we look at patterns, there is a pattern where consulting companies make a large amount of money migrating from one to the other.

And that helps the software company because having partners recommend your product is one of the channels. And I think that those core architectural patterns are important. So my understanding is if you’re gonna deploy Databricks, you would deploy Unity Catalog. Yeah, the core. It’s not optional anymore, it’s just a core component.

And then if you’re gonna deploy Snowflake, I would suggest you would start planning how you can decommission all the third party products. Because the pattern I see them doing as an organization is starting to become end-to-end themselves. They’re bringing, just like Salesforce is buying Snowflake, you’re bringing it in.

So if you rely on a third party component, you’ll start seeing pressure to start using the Snowflake equivalent over time because. Their lifecycle as a company, that’s where they’re going. That’s a patent we’ve seen before and we shouldn’t be surprised when we see it again.

Chris: I know of a company in Finland, that’s the direction that they’ve picked up.

I have one of my mentees is a senior data engineer for a company out there and they have Snowflake and they’re doing everything within Snowflake and they’re writing Python scripts. They’re orchestrating those Python strip within Snowflake, scheduling them in Snowflake and trying to do everything within there.

But yeah, absolutely.

Shane: If you’re buying DBT or using DBT, you shouldn’t be surprised when more and more features go behind the paywall because that’s the life stage, that’s the pattern of where that company’s at. Excellent. Alright, so if people wanna get a hold of you, where do they find you? Do they find any writing?

How can they get in touch and see what you’re doing?

Chris: Absolutely. So I have a YouTube channel. It’s the Data engineering channel. So you could find me there as well as my website, which is GA Data Engineering. My name is spelled a little bit weird, so it’s G-A-M-B-I-L-L, data engineering.com or or just gamble data.com.

Both redirect to the same site and those are the big places. And then obviously on LinkedIn you could find me, Christopher Gamble on LinkedIn. There’s a lot of Christopher Gambles out there. So you’ll see the one that’s connected to Gamble data.

Shane: I think that’s the other pattern. If you want to try and get into the space and make a name for yourself, change your name ’cause uh.

Shane Gibson, the guitarist. He’s more famous than me. And then Shane Gibson, the sales trainer who’s way more famous than me and it’s, damn, I need to change my name. But there was a guy in New Zealand changed his name to Mark Rocket ’cause he’s just been in, in in, what’s the rocket that looks like a penis? The Amazon guy.

Bezos Rocket. Yeah. Oh yeah. He changed his last name to Rocket. So there you go. That’s like dedication. That’s awesome. Excellent. Hey look, thank you for coming on the show and describing some of those patterns. That’s been pretty cool.

Chris: Absolutely. Thank you. Thank you very much for having me.

Shane: I hope everybody has a simply magical day.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

The pattern of writing a data book with Shane Gibson and Ramona C Truta

Shagility — Thu, 05 Jun 2025 00:45:25 GMT

Join Ramona C Truta as she interviews Shane Gibson about the patterns of writing a data book (and they discuss a whole lot of other data things)

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/the-pattern-of-writing-a-data-book-with-shane-gibson-and-ramona-c-truta-episode-64/

Listen to AgileData Podcast Episode

Read

Read or download the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/the-pattern-of-writing-a-data-book-with-shane-gibson-and-ramona-c-truta/#read

Google NoteBookLM Briefing

Briefing Document: "Scratching the Book Itch: A Patterned Journey" - AgileData Podcast

Source: Excerpts from "Scratching the Book Itch: A Patterned Journey" - AgileData Podcast featuring Shane Gibson and Ramona C Truta.

Subject: Discussion on writing a book, the concept of "patterns," the role of data professionals, and the future of learning and data in the age of AI.

Key Participants:

Shane Gibson: Podcast Host (acting as guest in this episode), Author of "An Agile Data Guide to Information Product Canvas." Agile Data Coach.
Ramona C Truta: Podcast Co-Host (acting as host in this episode). Data professional with a strong background in mathematics, computer science, databases, academia, and research. Describes herself as a "facilitator of knowledge."

Executive Summary:

This podcast episode features a conversation between Shane Gibson and Ramona C Truta, where Shane discusses his journey writing his book on data patterns. The discussion delves into the concept of "patterns" as common solutions to common problems, drawing parallels to architectural design. Shane shares the challenges and lessons learned during his writing process, including overcoming anti-patterns like rigid outlines and lack of consistency. A central theme is the evolving role of data professionals as "translators" and "storytellers" in the age of data and AI, moving beyond purely technical expertise. The conversation also explores the unexpected pattern of data teams replacing manual Excel processes and questions the traditional justification for data team value. Finally, the hosts ponder the implications of AI on learning, critical thinking, and the future of education, advocating for a focus on skills and storytelling over rigid roles and knowledge acquisition.

Main Themes and Important Ideas/Facts:

The Concept of Patterns:

Drawing inspiration from Christopher Alexander's "A Pattern Language" (about building houses), a pattern is defined as a "common problem" with a "common solution that fixes that problem most of the time."
Patterns provide "frameworks and patterns that I can anchor to" to explain complex concepts simply.
The idea of "antipas" is introduced – solutions that create more waste than they solve.
Ramona initially found Shane's focus on patterns perplexing but came to understand them as a way of structuring knowledge.
The authors believe that experience (accumulating "massive data sets" over "several decades") allows for the building of "knowledge graphs" and applying "system thinking" to identify and apply patterns.

The Challenges and Process of Writing a Book:

Shane describes writing a book as "scratching a itch," something he felt compelled to do.
It is "incredibly hard" and "lots of people start and never finish."
Shane learned that "you don't make money out of a book."
Initial attempts failed due to anti-patterns:
Rigid outlines/table of contents: Shane kept "rewriting the table of contents" instead of writing content.
Inconsistent writing: Writing only "when I felt like it" was not effective.
Shane developed a successful process involving "forcing functions," specifically making his word count public on LinkedIn and aiming for "60,000 words in six months."
This process forced him to "research" and led to better content by "enhancing your understanding of what you know."
He found that "collaborating on a book is actually harder than writing it by yourself," due to differing styles and visions.
Shane realised the importance of "what am I not gonna write," applying a "constraints based model" to keep the book focused.
Running a related course helped refine content and identify areas needing clarification based on student questions. Shane's future pattern for writing is to "write the course first. Then write the book."
Writing the book also revealed "new problems to solve," such as learning about publishing details (e.g., "Adobe InDesign bleeds and Amazon Kindle Direct Publishing bleed formats").

The Evolving Role of Data Professionals:

Ramona describes her role as a "facilitator of knowledge," translating "the tech talk... into business talk and vice versa." This involves removing "the fluff... and the foam, and I speak caffeine."
Shane agrees that being a "translator" is a valuable role, especially with the advent of AI.
Both hosts highlight the need for "system thinking" and applying accumulated knowledge and frameworks.
The ability to "take these technical things we do and explain them by just telling stories" is crucial. Shane references the analogy of using a decision tree to explain a complex neural net model.
The authors reflect on how people often use jargon to describe the same concept, hindering understanding.
The hosts believe that "curators of that knowledge, humans that actually tell a story. Based on that knowledge in a different way will actually become the more successful people." This returns to the ancient pattern of "storytellers being the most valuable part of the tribe."

Unexpected Patterns in Data Value and Teams:

An unexpected pattern emerged from the collaborators on Shane's book: the "unlike" scenario (what would happen if the information product wasn't built) consistently resulted in "manual process using Excel."
This led Shane to question if data teams are "overthinking it" in justifying their value, as they are often simply replacing inefficient manual processes.
Shane provocatively asks if data teams are essentially a "cost center" and questions the traditional model of data teams as a centralised "shared service" that has to "go and justify our existence."
He draws a parallel to finance teams, which are also a shared service but are mandated and don't need to justify their existence.
He considers whether the future might involve decentralising data professionals, embedding them with operational teams.

AI, Learning, and the Future of Education:

The advent of AI provides "access to a knowledge base. In a way we've never had before."
Ramona is enthusiastic about "learning how to learn" with AI tools, using chatbots to learn and test her knowledge.
Shane questions the assumption that AI requires more structured data, hypothesising that "we will structure our data less" and let LLMs find patterns in chaos, similar to k-means clustering versus complex SQL queries.
They discuss the "non-deterministic nature" of AI and the paradox that "we don't trust it" yet trust humans who are arguably more non-deterministic.
The hosts express concern about how future generations will develop critical thinking if they rely solely on AI for answers.
They believe that traditional educational institutions, focused on "giving you some knowledge and testing whether you got the knowledge," are using "legacy education sessions."
The future of education should focus on teaching "skills, not roles," allowing individuals to combine skills ("storytelling skills and some science system thinking skills") to fill various roles and solve complex problems.
The discussion questions the need for traditional grading if the focus shifts to problem-solving and storytelling.

Key Quotes:

Shane: "Today I'm gonna be the guest and you are gonna be the host."
Ramona: "I have started in data since ever I joke that I came on this earth doing data..."
Shane: "...that idea of being a translator, I think we see. People fall into that role over time... because we've spent years observing and learning and training our brains to recognize patterns..."
Ramona: "...what I am now is a facilitator of knowledge... I take the tech talk... and translate it into business talk and vice versa."
Shane: "...the foam on the top of the cappuccino... I'm the person that goes down to the caffeine."
Shane: "...a patent really is there's a common problem. There's a common solution that fixes that problem most of the time..."
Shane: "...I used to get really frustrated that whenever we went into a new customer, we seem to be reinventing the wheel."
Shane: "...I decided that what I needed to do was a forcing function that would make me write consistently..."
Shane: "...it forced me to research... I could see the patterns that were anchoring me..."
Ramona: "...even in the first iteration... you are trying to figure out a pattern in the knowledge... It's meta the way I see it..."
Shane: "...collaborating on a book is actually harder than writing it by yourself?"
Shane: "...it's the book I wanted to write and like I said, I wrote it to scratch a itch."
Ramona: "...that is what makes. Sense to me, but I understand, and I know there is order and structure in chaos. You just have to find it."
Shane: "...because we know the machine is not deterministic... We don't trust it... However, when we ask a human... We trust that they know what to do... So there's this economy between not trusting a machine but trusting a human."
Shane: "...the technical tools that a data scientist use is Excel and Tableau, like bollocks..."
Ramona: "...I became so passionate... realizing how people just touch the surface of things and don't know what's underneath..."
Shane: "...writing a book for people that were inquisitive... It tells you what you need to know, but actually you still have to go and do more work."
Shane: "...what am I not gonna write? That was more important."
Shane: "...the answer was manual process using Excel."
Shane: "...maybe we're overthinking it because really the alternative is do you wanna run that process in Excel or do you not? Because that's actually the thing we're replacing nine times outta 10."
Shane: "...data teams are like a finance team. Without the mandate..."
Shane: "...the questions that I get asked is the content the book should focus on."
Shane: "...maybe we're looking at it from different angles... Structuring or cleaning the data. Those are D different aspects of it."
Shane: "...maybe the answer is actually we do get embedded with the operational teams. We no longer become a centralized shared service. We become decentralized."
Shane: "I have a hypothesis... that actually will structure our data less..."
Shane: "...curators of that knowledge, humans that actually tell a story. Based on that knowledge in a different way will actually become the more successful people..."
Shane: "...we are trying to apply legacy patterns that haven't worked for us anyway into this new world."
Shane: "...the education system has to teach skills, not roles."
Shane: "If you're gonna go and write a book, do it for yourself."

Conclusion:

This podcast provides a rich and insightful discussion on the multifaceted journey of writing a book, particularly in the context of data and technology. It highlights the personal and professional growth that comes with the process, the importance of identifying and applying patterns (both positive and negative), and the evolving landscape of the data profession in the age of AI. The conversation provocatively questions established norms around data team value and the future of education, advocating for adaptability, storytelling, and a focus on core skills. The discussion is marked by the candidness of the hosts and their willingness to share both successes and failures.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

The Hook data modeling pattern with Andrew Foad

Shagility — Sun, 18 May 2025 19:18:40 GMT

Join Shane Gibson as he chats with Andrew Foad on his data modeling pattern "Hook"

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/the-hook-data-modeling-pattern-with-andrew-foad-episode-63/

Listen to AgileData Podcast Episode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/the-hook-data-modeling-pattern-with-andrew-foad/#read

Google NoteBookLLM Briefing

Briefing Document: The Hook Data Modeling Pattern

Source: Excerpts from "The Hook data modeling pattern with Andrew Foad - Episode #63" of the Agile Data podcast.

Date: 5 May 2025 (Approximate date of recording based on podcast reference)

Subject: Review of the Hook data modeling pattern, its origins, core concepts, and benefits, as discussed by Andrew Foad (creator of Hook) and Shane Gibson (podcast host).

Executive Summary:

The Hook data modeling pattern is presented as a simpler, more agile approach to data warehousing, developed by Andrew Foad to address perceived problems with traditional data modeling techniques, particularly Data Vault. Hook focuses on organising raw data by aligning it to business concepts using "hooks" (formalised business keys) rather than transforming the data structure upfront. This approach aims to reduce bottlenecks, increase agility, and make data more accessible for self-service analytics with "just enough modeling" (JEM) in the consumption layer.

Key Themes and Ideas:

Origins and Motivation: Hook was developed by Andrew Foad out of frustration with the complexity and bottlenecks encountered on Data Vault projects. A key moment was an engineer expressing a preference for a simple satellite table over a fully modelled Data Vault structure, enabling quicker report building. This highlighted the desire to decouple modeling from data delivery and prioritize getting data into a usable format faster.

"Hook is my attempt to solve problems that I encountered with data vault."
"There has to be a better way... why do we have to do the modeling up front... is there a way that we could perhaps not have to do that?"
An engineer's response to receiving a satellite table without a full Vault structure: "'Yes give me that because if you give me that I can go and build the reports off the back of that.' He didn't care about the vaulty stuff It was a hindrance It was in the way He wanted to get to the right hand side as quickly as he possibly could."

Core Concepts of Hook:

Not Traditional Modeling: Foad argues that Hook isn't traditional modeling because it doesn't change the structure of the incoming data. It focuses on adding information and aligning data to business concepts.
"I don't like to say modeling because I don't think it really is modeling for me... With hook you don't do that Whatever comes in is what comes out the other end All you're doing is you're adding some additional information to it."
Aligning to Business Concepts: The core function is to align raw data objects to predefined business concepts (e.g., Customer, Order). This is similar to the concepts behind Data Vault hubs, but without the physical restructuring.
"Basically you're aligning those objects to business concepts... but you're not changing the underlying data You're basically tagging those assets aligning those and formalizing the business keys."
The 'Frame' Object: Hook primarily uses a single object type called a 'frame'. A frame is a wrapper around the raw or landed table. The data within the frame is not transformed, only augmented with "hooks".
"We only have really one object type and it's called a frame... Basically you take your raw or your landed table and you wrap it You frame it... The data isn't transformed It's just a wrapper that you put around it."
'Hooks' (Formalized Business Keys): Hooks are formalized business keys that are added as additional columns to the frame. These keys align the frame to business concepts and enable joining across different frames.
"The additional things that you add to that are formalized business keys which align to those business concepts... So basically you've got a big bus matrix then concepts and assets or frames."
Example: An HK customer field in both customer and orders tables allows them to be joined.
'Key Sets': Key Sets are prefixes applied to formalised business keys within a hook. They provide context about the origin or type of the business key, particularly useful when dealing with multiple source systems or different keys for the same concept (e.g., Customer ID vs. Customer Code).
"It's basically a predefined sequence of characters which tells you where that key came from... It's a bit more than that It's not just distinguishing between systems It's given us something that we can use to basically give a bit more context around that business key."
Key sets are defined in a reference table or metadata store.
Business Glossary First: A core rule in Hook is that a hook (formalised business key) cannot be created unless the corresponding business concept has a definition in the business glossary. This prioritises business understanding before applying it to data.
"One of the hard rules in hook is that it has to be in the glossery and you really should have a definition for it as well."

Agility and Implementation:

Hook can be implemented using views or by adding physical columns, making it very agile. Changes like adding, dropping, or renaming keys don't require refactoring the core structure.
"You could implement that as say a view... or you could have a physical table You just add a column calculate it drop a column if you want to get rid of it So it's really agile in that respect You don't have to refactor the model at any point."
The approach is lightweight, focusing on organization rather than complex transformation upfront.
"It's like data world I guess but you've just collapsed all those business keys into the satellite That's really all it is."

Layered Architecture Context:

Hook fits best in the "designed" or "silver" layer of a multi-layered data architecture (e.g., Raw, Designed, Consume or Bronze, Silver, Gold).
Raw Layer: Ingestion of source data as is.
Designed Layer (Hook): Applying the Hook pattern to the raw data. The physical structure remains source-aligned, but hooks provide logical alignment to business concepts. This is the "organize" step in the ELO (Extract, Load, Organize) pattern.
"Hook is about the data warehouse component... it's the inman criteria the subject oriented integrated time variant and then immutable it's that bit after that you've got a consumption bit that's when you have to do some modeling but the idea is because you've organized things in the hook structure what we found is that you don't need to do too much modeling."
"The designed layer what we used to call the EDW that is where hook comes into its own."
Consume Layer: Specific modeling (the "small T" in ELO) is done here for particular use cases that cannot be easily met directly from the Hook layer. This is "just enough modeling" (GEM).
"So it's modeling light because you've organized You don't need to do too much modeling That's the idea."
This layer can produce different output formats (dimensional, flat wide tables, activity schemas) depending on the consumption tool's needs.
Complex logic, aggregations, measures, and metrics are typically handled in this layer.
Data can sometimes be consumed directly from the Hook (Designed) layer if the use case is simple and users are familiar with the source structure and the added hook context.

Handling Complexity:

Historical Data: While not explicitly part of the Hook pattern itself, historical data (tracking changes over time using effective dates, is current flags, etc.) is handled in layers after the raw ingestion, typically in the Silver layer, separate from the core Hook organisation. This is a common problem across all modeling techniques.
"That isn't a hook thing but that's absolutely something that we've done... we applied that row effective from row effective to row is current row is deleted We added those fields on as well using those same techniques using the windowing functions."
Master Data Management (MDM) / Deduplication: Inferring relationships between keys from different sources (e.g., mapping Customer 123 from System A to Customer ABC from System B) is seen as heavy lifting that happens outside the core Hook layer, likely producing another dataset that can then be ingested and have hooks applied.
"At the end of the day you've got to create an asset which does the mapping between one key to another key and that's just another asset."
Derived/Inferred Data: Calculations, aggregations, measures, and metrics are pushed down to the consume layer or a separate processing layer before consumption.
"The measures and the metrics because we're inferring or calculating things They're going to be in that consume layer That's where we're going to push all that heavy lifting again."

Metadata and Automation:

Hook relies on a simple metadata model to define concepts, key sets, hooks, and frames.
This metadata can be used to automate the generation of Hook structures (views or tables) and downstream assets, reducing manual effort and ensuring consistency.
"The meta model itself is pretty straightforward There's only a few tables... you've got hook you've got key sets you've got concepts and frames... you just tag stuff and then there's little templating engine in there which says how do you want to spit this out."
Automation can generate SQL scripts or configuration files (like YAML for DBT).

Key Facts:

Hook was created by Andrew Foad.
It emerged from experience with Data Vault projects and the desire for a simpler approach.
It aligns data to business concepts using formalised business keys called "hooks".
Data structures are typically not transformed from the source; Hook wraps the raw data in a "frame".
"Key sets" are used to prefix and provide context for business keys.
Implementation can be physical (adding columns) or virtual (using views).
It promotes a "just enough modeling" (GEM) approach in the consumption layer.
It fits well within a layered data architecture, primarily operating in the "designed" or "silver" layer.
It relies on a simple metadata model for automation.

Quotes of Note:

"Hook is my attempt to solve problems that I encountered with data vault."
"Whatever comes in is what comes out the other end All you're doing is you're adding some additional information to it... Basically you're aligning those objects to business concepts."
"We only have really one object type and it's called a frame... The data isn't transformed It's just a wrapper that you put around it."
"The additional things that you add to that are formalized business keys which align to those business concepts... So basically you've got a big bus matrix then concepts and assets or frames."
"One of the hard rules in hook is that it has to be in the glossery and you really should have a definition for it as well."
"So it's modeling light because you've organized You don't need to do too much modeling That's the idea."
"The designed layer what we used to call the EDW that is where hook comes into its own."
"It's not ELT it's ELO Extract load and organize You're not restructuring There's no T involved."
"We only model by exception We don't model absolutely everything."

Areas of Discussion/Further Exploration:

The practical limitations and "horrible use cases" where Hook might not be the best fit (though Foad suggests its scope is limited to the warehouse organisation layer, pushing complexity elsewhere).
Specific details on the metadata model and the capabilities of automation tools like "Hook Studio".
Detailed examples of how Key Sets handle various complexities.
Comparisons to other lightweight data organisation patterns.
Real-world case studies beyond the initial project mentioned.

This briefing provides a foundational understanding of the Hook data modeling pattern as described in the podcast episode, highlighting its core principles, benefits, and architectural fit.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Have I got the book for you!

Start your journey to a new Agile Data Way of Working.

Buy the Agile Data Guide now!

«oo»

Increasing the perceived value of your data team with Aaron Wilkerson

Shagility — Thu, 10 Apr 2025 20:53:31 GMT

Join Shane Gibson as he chats with Aaron Wilkerson on ways data teams can increase the perception of the value they add to their organisation.

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/increasing-the-perceived-value-of-your-data-team-with-aaron-wilkerson-episode-62/

Listen to AgileData Podcast Episode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/increasing-the-perceived-value-of-your-data-team-with-aaron-wilkerson/#read

Google NoteBookLLM Briefing

Key Participants:

Shane Gibson: Host, Agile Data Podcast
Aaron Wilkerson: Director of Data Strategy and Products at Carhartt

Main Themes:

The Perception of Lack of Value: The podcast kicks off by acknowledging a common sentiment that many data teams are not perceived as delivering significant value to their organisations. This is seen as a problem, potentially exacerbated by recent economic downturns where data teams were impacted.

Quote (Shane): "there still seems to be this perception that data teams aren't adding value. And I think a lot of it comes from when the downturn happened, a lot of data teams got hit. Like a lot of good people lost their jobs and that reinforced this message or this perception that data teams aren't adding value."

Understanding and Managing Expectations: The perceived lack of value is often tied to unclear or misaligned expectations. Legacy businesses that "fell into data" may have initially had low expectations that have since grown with increasing capabilities. Newer teams, conversely, may face very high expectations from the outset.

Quote (Aaron): "I think the reason I think it really depends on the organization. I've been a part of a lot of organizations... They fell into data. They didn't really know what they want data to do. So they started with usually it's like a database, right?... I would say expectations I would say started pretty small around hey we just need you guys to do these couple things but then as you start to build out reporting... I think that you started to see just more expectations coming out of the data team."
Quote (Aaron): "I really think it depends on the leader and really what that mandate is. What were you brought in to do or what is your team being asked to do? But if you don't know then I really then that's also on the leader to define that and say hey this is what I think you guys want us to do. This is what I think is important. Am I right or or am I wrong?"

The Need for Proactive Engagement and "Selling" Value: Data teams often operate in a reactive mode, waiting for requests. To increase perceived value, they need to be more proactive, understand the organisation's strategy, experiment with data to identify potential value, and then "sell" those possibilities internally.

Quote (Shane): "What I see with data teams is often they sit in cupboard. They sit behind this wall of data work and they never really go out and showcase what they've done and so they don't sell it."
Quote (Shane): "if they spent a bit of their time for understanding the organization strategy like where's it going what is its goals and thenffect ly playing with data as data experts to say, hey, given that goal, if we use this data in this way, it may have value. And then going back and and treating as something they have to sell internally, I think it'd be really more fun for them..."
Quote (Aaron): "a lot of us we came into this job because we wanted to be coders or programmers where you believe that hey I'm just showing up to write code and make to build something but it's not my job to sell people on what I'm doing to tell people."

Shifting from Ticket-Based Work to Value-Driven Outcomes: The traditional model of waiting for "geo tickets" hinders the perception of value. Data teams need to focus on understanding the business problem and how data can solve it, adopting a more product-centric approach.

Quote (Aaron): "The process is we get a ticket that comes in, we write it and we give it to someone else and someone else's job is to sell what we do. And I think that's where it's come to our detriment."
Quote (Shane): "look at product management right if you look at those types of roles which you we're starting to see that role of data product manager come in into teams now and it's not sales and marketing it's basically understanding the business problem and then trying to figure out ideulate how you could solve it with data discover which option is actually the best one and then go and implemented."

The Data Catalog Misconception: Data teams often believe a data catalog alone will showcase their value. However, business stakeholders don't typically engage with it. The focus should be on how data is surfaced and the value it provides in a user-friendly way.

Quote (Shane): "Our our store is the data catalog. And I'm always intrigued when data teams believe that having a data catalog, pumping more data into the catalog and and then that's the end of their job. Somehow everybody's going to come shopping in the catalog. Whereas actually data cataloges are only designed for data people. You go talk to a stakeholder, they don't want to use your data catalog. They don't care. There's a bunch of data in there. There's no value to them. They're not data people."

Understanding the Business Context and Processes: Data teams need a deep understanding of business processes to effectively leverage data for improvement, not just reporting. The increasing complexity and automation of business processes within SaaS and ERP systems mean that even business users may not fully understand the underlying data flows.

Quote (Aaron): "I think the the challenge is it's the next level of the four steps of analytics realm. I think the scriptive, diagnostic, predictive and prescriptive. So our value then becomes how do we not only just tell them the story we tell them why is going on how can we do things better..."
Quote (Shane): "if you work with a data team, we love to do data lineage. We love to do this idea of nodes and links to see the flow of the data through our system. And why do we do that? Because that map helps us visualize the works being done, understand what's happening where, understand where we can fix it, where it's broken."

Over-Engineering and Complexity: Data teams can sometimes over-engineer solutions and overcomplicate explanations, making it difficult for business users to understand the value. Simplifying language and focusing on business outcomes is crucial.

Quote (Aaron): "I think that's also one of our challenges around value is that we've overtoled and over complicated a lot of the work that we do and it was very difficult for us to give the business simple answers to question..."
Quote (Shane): "We love we as data people, we love to argue semantics. What is a semantic layer? What is a data product? What is a data contract? We we get intrigued by engineering the the words and somehow we think exposing that to our stakeholders is a good idea. It's argue amongst yourselves, but give them simplicity. Just give them a definition."

Accountability for Value: While data teams provide the data, the stakeholders requesting it are ultimately responsible for delivering business value. However, data teams need to understand and advocate for the potential value of their work and push back on requests that lack clear purpose.

Quote (Shane): "somehow we're held accountable for the value of that data, not the stakeholder who's trying to make that business change."
Quote (Aaron): "I always think the answer the question right if my CEO came to me and say hey what are you guys working on I tell her and she said what's the value. I don't know. I'm giving it to someone else for them to figure out. It just it doesn't come off across as a good answer, right?"

Data Teams as Optional vs. Essential: Data teams are often perceived as "nice to have" rather than fundamentally essential to the core business operations, unlike functions like finance. Embedding data deeply into operational processes can help shift this perception.

Quote (Shane): "we are still perceived in most organizations as being optional."
Quote (Aaron): "I still think that organizations are still trying to figure out what data teams are how to get value out of it because it's just so different organization... unless you really embed your data team into the operational nature of and say, 'Yeah, we actually can't run without the data team.' That's more like operation use cases. I think that's really the challenge that we're still proving our worth so to speak in many different organizations."

The Pitfalls of Constant Replatforming: Frequent technology changes without a clear link to business value can erode trust and the perception of value. Stakeholders don't necessarily see the benefit of repeated "kitchen renovations."

Quote (Shane): "Replatforming every couple years cuz we think it's cool doesn't do us any favors if you're sitting outside the data team and saying what value have they added to us this this year."
Quote (Aaron): "I've been guilty that myself in my career is trying to argue for the replatform... I just need 12 18 months and $3 to $5 million to replplatform it. But at the end of that, I promise you... I think that's where a lot of us have gotten where we do this 18-month transformation and most likely by the end of that your leaders leaving..."

Time to Value and Cycle Time: Stakeholders' perception of time starts from when they make a request, not when the data team begins work. Optimising the entire cycle time, from request to delivery, is crucial. Managing the intake queue and setting realistic expectations are also important.

Quote (Shane): "we often start the clock ticking from when we picked the work up. Oh yeah, it was great. Picked the work up like cra I smashed it a couple of days and it was in their hands yet it's been sitting in that queue for 3 months."
Quote (Aaron): "They're like, 'That's great.' But I asked for this like 6 months ago. So, they don't see it's you're getting like back to out of debt."

The Importance of Roadmaps and Transparency: Communicating the data team's plans and progress through clear roadmaps helps manage expectations and demonstrate value delivery over time. Time horizons rather than fixed timelines are recommended due to inherent uncertainties.

Quote (Aaron): "road maps have been very big to me right now trying to figure out what's the best way to create and visualize road maps to show our business partners because I think to your point like they just we just have to be honest with about hey you're not going to get this for a year."
Quote (Shane): "what they tend to do is they tend to do time horizons... So they tend to say, we'll have a a dot on the page and work that's closer to the dot is more likely to be done and then maybe there's another time horizon and then stuff out there will probably get done at some stage..."

Visualising Delivered Value: Creating visual representations of the value delivered, perhaps by "colouring in" domains on an organisational map as data capabilities are built, can help stakeholders understand the team's impact over time, especially given leadership turnover.

Quote (Shane): "What we can do is we can wireframe out what we think we're going to build and then color it in as we build it... and what we're doing is we're visually showing this map of the value we've added over the last couple years because people forget..."

Incremental Delivery and Building Trust: Providing value in smaller increments allows for feedback, builds trust, and ensures alignment with evolving business needs. Waiting for a large, year-long project to complete increases the risk of delivering something that is no longer needed or doesn't meet the current requirements.

Quote (Aaron): "you definitely want to show incrementally over time because you want to make sure that people feel comfortable that this are going to pay you all the different installments and you have a good experience at the end..."
Quote (Shane): "by showing them something early, we increase their fluency as much as anything else and we get feedback and we get the ability to change before we've gone and build all the pipelines to feed that dashboard or whatever the way we deliver that product."

The Role of Data Storytelling and Data Product Managers: Effectively communicating insights and the value of data through storytelling is becoming increasingly important. The emergence of data product manager roles reflects the need for individuals who can bridge the gap between technical data work and business understanding.

Quote (Shane): "one of the key takeaways for me is this idea of a storyteller we starting to see data storytelling becoming a thing where instead of just giving them back a list of data, we're starting to tell them a story about that data uh in business context."
Quote (Aaron): "that's where you're seeing a lot of the data product managers right you're hearing much more about that role coming out because you realize that that's the don't know it's the data storyteller plus uh storyteller plus your product manager who can tell you about it but they can also figure out if anybody's using it requirements I think that's where you're starting to see these roles get created..."

Sales as a Natural Part of the Role: Demonstrating value requires data professionals to embrace a degree of "sales," not in a pushy way, but through clear communication and highlighting the benefits of their work. This is essential for career longevity and ensuring the team's continued relevance.

Quote (Aaron): "I realized that sales is a natural part of our jobs and our life, right? If I lose my job, my new job is I have to go sell myself to a new company... So I think we have to realize that sales is a part of..."
Quote (Shane): "Describing the value we've added to the organization we work in and keeping adding that value and therefore keeping our jobs is much easier and cheaper than losing our job and having to go into a recruitment round where we have to sell the value we could offer rather than the value we have delivered."

Iterating on Value Communication: Just as data work itself requires iteration, so too does the communication of value. Data teams should consciously experiment with different ways of articulating their impact and gather feedback on what resonates with stakeholders.

Quote (Shane): "I think I'm probably going to add One more layer into that now which is how are data teams experimenting and iterating with describing the value they've added that feedback loop to their stakeholders."
Quote (Aaron): "That's also something I'm working with my team on is doing retros to say okay three months ago in the last quarter we worked on this thing and no one used it or everyone used it. So why did they or did they then not use it? And that should be an input into the future work..."

Most Important Ideas/Facts:

The perception that data teams lack value is a significant challenge that needs to be actively addressed.
Proactive engagement with business stakeholders, understanding their needs and the organisation's strategy, is crucial.
Data teams need to move beyond simply fulfilling requests and focus on delivering measurable business outcomes.
Clear communication, transparency through roadmaps, and visualising delivered value are essential for demonstrating impact.
Incremental delivery and continuous feedback loops help build trust and ensure alignment.
Embracing data storytelling and potentially dedicated roles like data product managers can enhance value communication.
"Selling" the value of data work is a necessary skill for data professionals to ensure their relevance and the team's success.
Iterating on how value is communicated and learning from past experiences is vital for continuous improvement.

Conclusion:

This podcast provides valuable insights into the challenges surrounding the perceived value of data teams and offers a range of actionable strategies for improvement. By shifting from a reactive, task-oriented approach to a proactive, value-driven mindset, and by focusing on clear communication, collaboration, and a deep understanding of the business, data teams can significantly enhance their perceived value and become indispensable partners in achieving organisational goals. It's not just about the data itself, but about the story of impact that the data team can effectively tell.

«oo»

Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”

Struggling to gather data requirements and constantly hearing the conversation above?

Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.

Buy the Agile Data Guide now!

«oo»

ELM Patterns Templates with Remco Broekmans

Shagility — Tue, 04 Mar 2025 18:24:07 GMT

Join Shane Gibson as he chats with Remco Broekmans about the pattern templates that are part of the Ensemble Logical Modeling (ELM) patterns.

ELM

https://www.elmstandards.com/

From Stories to Solutions: Redefining Communication Between Business and Data Professional

Listen

Listen on all good podcast hosts or over at:

https://agiledata.io/podcast/agiledata-podcast/elm-patterns-templates-with-remco-broekmans/

Listen to AgileData Podcast Episode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/elm-patterns-templates-with-remco-broekmans/#read

Google NoteBookLLM Briefing

Briefing Document: Agile Data Podcast - ELM, Model Storming, and Data Modeling

Executive Summary:

This podcast delves into the world of data modelling using agile principles, particularly the ELM (which isn't spelled out but I think is Ensemble Logical Modelling) approach. The conversation centres on "model storming," a collaborative workshop method for eliciting business requirements and building shared understanding without drowning stakeholders in technical jargon. Remco, the guest expert, emphasizes getting the right business people in the room, using their own terminology, and employing a structured, yet flexible, process facilitated by six key artifacts. The discussion also touches on the potential of Large Language Models (LLMs) to assist in the data modelling process, but warns against automating it entirely due to the inherent complexities of business needs.

Key Themes and Ideas:

Model Storming: A Collaborative Workshop Approach

Definition: Interactive workshops where data professionals and business stakeholders work together to solve problems, elicit requirements, and foster a shared understanding of the business.
Remco: "To invite people into a workshop from the business, from the organization without using any technical terms."
Shane: "This idea of running interactive, collaborative workshops with our stakeholders, where we work together to use some patent templates to solve a problem, to either problem solve together or to elicit. Some requirements or some shared understanding..."
This differs from traditional data modelling which can be off-putting to business stakeholders: "Dear business, would you please help me? And then you make an appointment and you're the only one in the room. Because they don't like the term data modeling."

The Importance of the Right People:

Getting the right people (business people, those who actually do the work) in the room is essential.
No proxies or IT analysts trying to represent the business.
Keep data and laptop distractions out of the room.
Remco: "Basically the people running the day-to-day business know what they're doing, know what's going on... But they're not one participating."
"I don't want to have data or laptops into the workshops because. Then they are going to, uh, be it focused instead of business focused."

Workshop Facilitation Best Practices:

Keep workshops to a focused duration (around 2 hours). Don't be afraid to be done early if there's nothing else to capture.
Use the business's own language. If they talk about "clients," don't translate it to "persons."
Don't chase people away with technical jargon (e.g., "crows feet," conceptual/logical/physical models).
Provide refreshments (donuts, waffles, lunch) to encourage participation and engagement.
Listen to what people are saying, and write it down. Don't correct or change what people are saying (e.g., Customer vs Client), this should come up later when discussing what the business actually means by those terms.
Make sure the terminology isn't getting in the way - Align, Refine and Design are better terms than Conceptual, Logical and Physical data modelling.
Use a shared language when drawing relationships between Core Business Concepts - such as "Customer bought product" rather than cardinality, or many-to-many relationships.
"Use their terminology in, in every single bit and every single sense."

Core Business Concepts: The Building Blocks

Definition (Remco): Terms that the organisation defines as important to know something about, where they can define, identify, describe, and provide context.
Definition (Shane): Things that the organization can identify and wants to manage or count (e.g., Customer, Employee, Supplier, Product, Order, Payment).
Examples are Customers, Employees, Products, Suppliers, Sales
Remco notes: "For me, a core business concept is everything where a term where the organization said, for me this is important to know something about. I can define it and I can identify it, and I can describe it and have some context to it."
It’s important to identify events and transactions (e.g. payments, sales) as Core Business Concepts.
"It's something that the organization can identify and they wanna manage it or count it."

The Six Artifacts of ELM (Ensemble Logical Modeling): A Structured Approach

CBC (Core Business Concept) List: A running list of terms with category swim lanes (Event, Person, Place, Thing, Other) to identify synonyms and missing concepts.
CBC Form: Provides detailed context for each term (Name, Category, Definition, Domains, Main Attributes, Relationships, Synonyms, Hierarchy). Includes identifying the context.
CBC Canvas: A visual tool using post-it notes in swim lanes to group terms, dedupe them, and identify potential synonyms/related concepts (e.g., Customer, Client, Kunt could all be variations of the same thing).
Event Canvas: Focuses on a specific event and identifies all the CBCs directly involved (e.g., for a "Sale" event: Customer, Employee, Store, Product).
MBR (Minimum Business Requirements) Form: Uses "data by example" to capture real-world instances of the event and its related concepts. This helps to understand granularity, cardinality, and any edge cases (e.g., can a customer buy multiple products in one sale?). This will need to come through multiple iterations.
MBR Matrix: Summarises the information from the previous steps into a "BI Matrix" (similar to Kimball's approach) for reporting purposes. Shows measures (what you want to measure) and angles (how you want to look at it). This is important because you can see any terms you haven't used, or haven't related to the events.

Note: These artifacts are open-source and available at ELMstandards.com (registration required).
Remco: "We have six artifacts and basically the first three are focused around the core base concept, and the first artifact is really the starting point is basically writing down all the terms just on the board."
These are iterative and there isn't a defined strict approach - you may jump from artefact to artefact as conversations evolve.

Data by Example: Unlocking Insights

Asking stakeholders for real-world examples of data records can reveal patterns, relationships, and edge cases that might otherwise be missed.
Shane: "You start getting examples of data to fit that event and C, b, C model, because we're looking for a understanding of, of some terms, we're looking for confirmation, we've got it right."

The Role of LLMs (Large Language Models) in Data Modeling:

LLMs can assist in the data modelling process by:
Generating initial data models from business case descriptions.
Suggesting definitions for core business concepts.
Identifying potential relationships between entities.
Reviewing and providing feedback on existing models.
Using Firefly, or other automatic note taking tools, and GPT to take the essence from the workshops.
However, LLMs should not be used to fully automate data modelling.Data modelling is a complex process that requires human understanding of business processes and data nuances.
The goal is guidance, not automatic data model creation.
Shane: "For me, this sits in the ask or assisted AI and automated AI."
Data modeling, when done in context of modelling a business and processes is a lot harder.
Remco: "I gave a business model to, to cloud, to chat, to co-pilot and just say, create a data model without any extra information. Here's a business case. Create a data model. It's coming back perfectly fine with, with a PR diagram. "That's scary because let's be very honest, data modeling is hard."

Temporal Data: Being a Master of Temporal Data

“And because we are data people and we are the masters of temporal data, there's a high chance that this podcast would just happen to go out when the book hits Amazon.”

Scoping for Agile Delivery

Use the matrices to determine iterative development based on what business concepts are most valuable first.
"we can deliver that bit first and it has value and then we can extend it out."

Conclusion:

This podcast provides a practical and insightful look at agile data modelling using the ELM approach. By focusing on collaboration, business-driven terminology, and a structured yet flexible workshop process, data professionals can effectively elicit requirements, build shared understanding, and create data models that truly reflect the needs of the business. While LLMs offer exciting potential for assisting in the process, they should not replace the human element of understanding and interpreting business needs.

Further Reading and Resources:

ELM Standards: elmstandards.com
Remco's Book: "From Storage to Solution" (Available on Amazon and Technics Publications)
Contact Remco via LinkedIn.

Data Operating Models patterns with Dylan Anderson

Shagility — Thu, 20 Feb 2025 23:28:04 GMT

Join Shane Gibson as he chats with Dylan Anderson about the patterns required to define a Data Operating Model.

Data Ecosystem Patterns

https://thedataecosystem.substack.com/p/issue-13-defining-the-data-operating

https://thedataecosystem.substack.com/

Listen

Listen on all good podcast hosts or over at:

https://agiledata.io/podcast/agiledata-podcast/data-operating-models-patterns-with-dylan-anderson/

Listen to AgileData Podcast Episode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/data-operating-models-patterns-with-dylan-anderson/#read

Google NoteBookLLM Briefing

Briefing Document: Agile Data Podcast - Operating Models for Data Teams

Source: Excerpts from "Agile Data Podcast" featuring Dylan Anderson of Perfusion Consulting.

Subject: Operating Models for Data Teams

Executive Summary:

This podcast explores the crucial role of operating models in enabling data teams to effectively support business strategy. Dylan Anderson, a Data Strategy Consultant, breaks down an operating model into three core components: Oversight & Direction, Structure of Delivery, and What People Do. The conversation highlights the common pitfalls of data strategies – being too business-focused without technical grounding, or vice-versa – and emphasizes the importance of aligning data strategy with overarching business objectives, fostering collaboration, and promoting continuous improvement within data teams.

Key Themes and Ideas:

Data Strategy Must Support Business Strategy:

The fundamental premise is that data strategy should be derived from and directly support the organization's overall business strategy.
Shane poses the question, "First thing I asked for is the organizational strategy, because my view is the data strategy is to support. The business strategy to be successful using data." This highlights the dependence of data strategy on a clear understanding of the business's goals.
Anderson agrees, stating, "The data strategy is really there to support the business strategy and enable that."
A key indicator of alignment is the team's understanding of how the organisation makes money. Shane suggests a simple, yet insightful question to assess this understanding.

Bridging the Gap Between Business and Technical Aspects:

A common problem is the disconnect between business-focused data strategies (fancy PowerPoints lacking tangible implementation) and technically-driven strategies (focusing on data and technology without clear business objectives).
Anderson notes, "One way is a pure business focused data strategy...but what you don’t get is the technical architecture...On the other side...you get the technical data strategy...but without talking to the business."
A successful data strategy requires both a strong understanding of business needs and the technical expertise to execute.

The Three Components of an Operating Model:

Anderson outlines a framework for understanding operating models based on three interconnected circles:

Oversight & Direction: Leadership sets the strategy, ensures alignment with business goals, and establishes performance management.
Structure of Delivery: The tangible processes, team structures, and reporting lines that enable the achievement of strategic goals.
What People Do: The day-to-day job descriptions and responsibilities that execute the strategy.

Importance of Change Management and Stakeholder Buy-In:

Implementing a successful data strategy requires more than just technical expertise; it demands effective change management and stakeholder buy-in.
Anderson emphasizes, "If you don’t think about this as a journey, and if you don’t involve the people who are going to be impacted by it, then it’s not going to work."

Data Teams as Horizontal Enablers, Not Siloed Verticals:

Many organisations incorrectly structure data teams as separate verticals, when they should be functioning as horizontal teams that support various internal business functions. This results in siloed work and limited collaboration.
Anderson: "A lot of organizations, their data teams are to support internal businesses...Yet most companies set the data team up as another vertical...whereas it should actually be a horizontal."

Key Patterns within Oversight & Direction:

Performance Management: Measuring progress, which often falls through the cracks, is essential to demonstrate the value of data initiatives.
Data Strategy and Team Goals: Clear alignment and direction are vital, ensuring everyone understands their role in achieving the overall strategy.
Governance Forums: Bringing people together to ensure alignment (and police policies), but these must be led effectively and have clear objectives to avoid being unproductive.
Operating Model Principles: Establishing a culture that fosters collaboration and effectiveness.
Program and Org Leadership: Requires both data leadership (accountability) and organisational leadership (direction) to be successful. Find and utilise internal "change agents" to help unblock progress.

Importance of Delivery Structure:

Reporting Lines: Determining who directs work versus who manages people's well-being.
Team Design: Utilising patterns like Team Topologies and Unfix to create effective team structures. Addressing the challenge of balancing fast-turnaround requests with large, long-term projects.
Workflow and Delivery Processes: Mapping out the entire process and ensuring effective communication at every stage. This includes stakeholders at scoping, dev and testing stages.
Communication is Key: Don't assume "no news is good news". Update stakeholders regularly with progress (and show incremental value even if the overall product isn't yet ready) to avoid the perception that the data team is a "black box".

Playbooks as a Communication Tool:

A "playbook" is recommended, a visual document that describes how the data team works to new team members and stakeholders.
It articulates the team's delivery models, engagement gates, and stakeholder involvement points.
Includes the team's value stream, team design, ceremonies and definitions of ready and done.

Evolving Roles and Responsibilities ("What People Do"):

Understanding individual career goals and tailoring roles to incorporate growth and development.
Moving beyond strict job descriptions to foster flexibility and engagement.
Understanding the different "personas" within a data team, their skills and specialities, and mapping gaps to develop training and hiring strategies.

Quotes:

"Everybody seems to focus on technology. They don’t focus on the people or the process or those other really important things that sit around a team and their technology to deliver." - Shane Gibson
"The data strategy is really there to support the business strategy and enable that. And how does data factor into what you do as a business and make what you do as a business better?" - Dylan Anderson
"If you don’t think about this as a journey, and if you don’t involve the people who are going to be impacted by it, then it’s not going to work." - Dylan Anderson
"An operating model is basically a ways of working that helps you understand how to collaborate across the organization and coordinate to deliver the initiatives and the tasks in front of you." - Dylan Anderson

Actionable Insights:

Ensure your data strategy is firmly rooted in your organisation's business strategy.
Prioritise clear communication and stakeholder engagement throughout the entire data project lifecycle.
Don't treat your operating model as a static document; encourage continuous iteration and improvement.
Focus on building a collaborative culture that empowers data teams to contribute effectively.
Structure your data teams as horizontal enablers that support various business functions.
Implement clear reporting lines and establish defined workflows and delivery processes.

DataOps Patterns with Chris Bergh

Shagility — Wed, 12 Feb 2025 20:15:23 GMT

Join Shane Gibson as he chats with Chris Bergh on improving your teams way of working by using DataOps patterns.

A list of great DataOps resources

https://datakitchen.io

The Dataops Cookbook

The Data Journey Manifesto

Open Source Data Observability

Open Source DataOps Data Quality TestGen

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/dataops-patterns-with-chris-bergh-episode-59/

Listen to AgileData Podcast Epsiode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/dataops-patterns-with-chris-bergh/#read

Google NoteBookLLM Briefing

Briefing Document: DataOps Patterns and Principles with Chris Bergh

Introduction

This document summarises the key themes and ideas discussed in the AgileData Podcast interview with Chris Bergh, co-founder of Data Kitchen. Bergh, a techie with a background at NASA and MIT, brings a unique perspective to DataOps, drawing from his experiences with software development, lean manufacturing, and a strong focus on customer success. The conversation unpacks what DataOps really means and how it applies to data and analytics teams, while also touching on the common problems these teams face and some possible solutions.

Key Themes

DataOps is More Than Just DevOps for Data: While DevOps practices like automation and CI/CD are important, they are "necessary but not sufficient" for DataOps. Bergh emphasises that DataOps also requires incorporating principles from lean manufacturing (like statistical process control and data quality testing) and a strong focus on the customer. "It's about productivity. It's about making your customers more successful instead of, let's do more automation, which is a means to get there."
The Data Factory Analogy: Bergh and host Shane Gibson discuss the analogy of a data team being like a factory, but acknowledge it's more complex than that. It's about the "left to right process of integrating data and producing insight," but also about the "perpendicular" value stream of rapidly iterating on and changing data products. This dual nature means data teams have to be both "manufacturing teams and software teams."
Customer-Centricity: A central idea is the need for data teams to move away from a purely technical focus to become more customer-centric, "helping teams get less focused on technology and more focused on their customer." Teams should be focused on delivering value to their customers, not just producing data.
Waste Reduction: The importance of reducing waste is a recurring theme. Waste isn't just about over-engineering or over-collecting data, but also about building things that aren't used and processes that aren't efficient. “It’s really about maximizing what you don’t have to do is the real key here,” says Bergh.
Team Empowerment and Ownership: A big chunk of the discussion is about empowering data teams to make decisions and take ownership of their processes. This includes the ability to stop the line, fix problems, and not be afraid to surface issues. As Bergh puts it, “As a leader, you own the result. Not the person who cut it up, not the supplier. It’s like you own it. And so you have to fix the process." He believes that "95 percent of the time... when you have problems, it’s mainly the process people work in and not the person."
The Need for Metrics (DataOps DORA Metrics): The lack of clear metrics for data teams was identified as a significant problem. Bergh highlights that while software teams use DORA metrics (Deployment Frequency, Lead Time, Change Failure Rate, Mean Time to Recovery), data teams don’t have an equivalent. They should be measuring things like error rates, cycle time, utilization/value of data products, and time spent actually creating value. “As data and analytic teams, we’re so unanalytic about how we run our organizations.”
The Importance of Feedback Loops: The conversation repeatedly comes back to the importance of getting feedback – from customers, from the data itself, and from the team’s processes. These feedback loops are essential for reducing waste, improving quality, and delivering value.
Data Quality is a Linchpin: Chris believes that data quality issues are often a critical bottleneck for data teams. Poor data quality often leads to blame and lack of progress. Bergh is pushing for data quality teams to target specific business-critical data elements rather than trying to fix everything at once. They should also be able to take actionable steps to provide specific data for the teams who are ultimately in charge of fixing it.

Important Ideas and Facts

DataOps is not new: While some might see DataOps as a recent buzzword, the underlying principles are derived from more established domains.
Teams can make incremental changes: Even in "blame" cultures, there are ways to make small improvements, such as quality circles, refactoring, and automating repeatable tasks.
Version Control is not enough: Teams need to have effective environments and test data in order to properly implement version control.
Teams are often overly focused on technology, not value: This leads to the development of data products that are unused or not valuable.
The "Excel user" problem: The data team often delivers complex data that is immediately deconstructed by business users in Excel – highlighting a mismatch between what's produced and what's needed. The analogy of a three-kitchen structure where the final meal is often deconstructed is used.
Leaders should focus on process: Leaders need to take ownership of their team's processes and create environments that enable them to succeed, rather than blaming people for errors.
There needs to be a balance in applying practices: There's not a single "right way" to implement these principals. For example, deciding how many attributes to bring in with data collection has a "balance and context that matters and having the discussion".
"Whiney little bitches": Managers who are blaming others should be called out.
Data modeling is a lost art: There are experts out there who can model data quickly and accurately, but it's important to enable the rest of the team.
Metrics are key for DataOps: If you aren't measuring the performance of your data teams, it will be difficult to drive change.
Instrument your data to understand value: Data teams should track who is using what data and how often to gain visibility into what their customers value.
Data contracts: Data contracts can be a way of getting teams to focus on the end-to-end cycle and take more accountability.
DataOps = Reduction of Waste: It can be viewed as a method to reduce waste in data processes.
Quality Circles: This is a method where you look at every error and try to fix them, which can be done by putting all errors into a spreadsheet.
Data Quality is the key: Data quality can be the key to moving forward with Agile data teams.
There are too many "Field of Dreams" Data Teams: Building it doesn't mean they will come, value needs to be the focus.
Open Source Tools and Resources: Data Kitchen offers open-source tools, training programs, and blog content that can help teams implement DataOps principles.

Quotes for Emphasis

"It’s a set of technical practices and management paradigms for data and analytic teams to drive customer success and be more productive." – Chris Bergh defining DataOps.
"I think the patterns here are not new... I just think we're, if anything, we're just trying to take these principles and say, look, they have a unique instantiation and data and analytics, but the ideas are old." - Chris Bergh on the origins of DataOps principles.
“I’m really sick of... the whiny little bitches who blame other people. I’m really tired of it, especially the people who lead teams.” – Chris Bergh on leadership accountability.
“The most important metadata of any organization is code.” – Chris Bergh on active metadata.
"If your customers trust your data, that means you have very low errors." - Chris Bergh

Conclusion

This discussion provides a valuable framework for understanding and implementing DataOps. Chris Bergh’s insights, drawn from a long and varied career, highlight the importance of customer focus, continuous improvement, and team empowerment. The conversation also serves as a call to action for data teams to step up and take ownership of their processes and ultimately, the value they provide to the wider organisation. By focusing on waste reduction and implementing solid metrics, data teams can move from being seen as cost centres to strategic assets within an organisation.

Merging Data Vault and Medallion Architecture Patterns with Patrick Cuba

Shagility — Fri, 31 Jan 2025 02:39:02 GMT

Join Shane Gibson as he chats with Patrick Cuba on combining the Data Vault data modeling pattern with the Medallion Architecture pattern.

This discussion was based on the patterns described in this article:

https://medium.com/the-modern-scientist/the-modern-data-vault-stack-75103102e3d2

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/merging-data-vault-and-medallion-architecture-patterns-with-patrick-cuba/

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/merging-data-vault-and-medallion-architecture-patterns-with-patrick-cuba/#read

Google NoteBookLLM Briefing

Briefing Document: Merging Data Vault and Medallion Architecture

1. Introduction

This document summarises the key points from a podcast discussion between Patrick Cuba and Shane Gibson, exploring the intersection of Data Vault modelling and Medallion Architecture patterns. The conversation delves into their practical experiences, challenges, and evolving perspectives on data modelling and architecture in the modern data landscape. They unpack the two main ideas and how they fit together.

2. Background and Experience

Patrick Cuba: Has a background as a "hardcore SAS architect," encountering Data Vault at a customer site in Brisbane. He recognized the potential for automation and developed a tool to generate Data Vault models. He's worked with various Data Vault implementations, including NoSQL platforms and tools like Warescape, and now works at Snowflake, focusing on customer onboarding and the practical application of Data Vault.
"my background is I was a hardcore SAS architect... And I actually came across Data Vault at a customer site."
"I was not intending to do any data vault work because it was about, getting customers on boarded onto Snowflake."
Shane Gibson: Also has experience in SAS, and finds it intriguing that Brisbane is an origin point for Data Vault adoption in Australasia. He's particularly interested in Data Vault's modelling patterns, as well as the way you build things around it. He advocates for a flexible approach, combining Data Vault with other modelling techniques like One Big Table and Activity Schema. He has a product using "landing history, design and consume" as it's layering concepts.

3. Data Vault Defined

Enterprise Vision: Patrick highlights Dan Linstedt's definition of Data Vault as "a system of business intelligence containing the necessary components needed to accomplish enterprise vision in data warehousing and information delivery." He emphasizes the "enterprise vision" as a key aspect, asking, "what are we actually doing with this data model?"
Core Components: The core of a data model, according to Patrick, must address:
Business entities
Relationships between entities (interactions, transactions, etc.)
State information of entities. These are then mapped to hubs, links and satellites in the vault.
Platform Advantage: Patrick believes Data Vault is advantageous because "nothing in the industry takes advantage of the OLAP platform’s capabilities quite like Data Vault does," as it embodies those three things.
Integration: Data Vault excels at integrating data by business keys, allowing non-destructive ingestion as data sources change. The data vault "is designed to ingest those non destructively."
Flexibility: Patrick acknowledges that Data Vault isn't always the best fit, stating, "looking at your use case, I think the complexity is not worth it. You should stick to Kimball modeling." He notes the need for a nuanced approach, depending on complexity.

4. Medallion Architecture Defined

Layered Approach: Patrick views Medallion Architecture as a "layered architecture" that's been around for a while but has been well-marketed. It involves landing, transforming, and presenting data, and he notes that many businesses have similar processes but use different terms. The layers, as he and others have used, have names such as "curated zone, coherent zone and intelligent zone", or "cell, EDW, and consumption layer".
Layered Principle: Shane sees it as a way to "set a bunch of principles. Policies or patterns to say, this is going to happen in this layer". It’s all about putting data into well-defined layers based on how it's used, rather than just a giant bucket. It's about code layers, not physical layers.
Consistent Purpose: The goal of layering is to avoid a "big bucket of crap that we can’t deal with." The various terms over time include "persisting staging area, EDW, retail presentation". It is important to be clear what the purpose of a given layer is.
Bronze, Silver, Gold Mapping: They discuss the mapping of layers:
Bronze: Is like an ODS (Operational Data Store), or source-aligned data (or a "landing history")
Silver: Represents the EDW (Enterprise Data Warehouse) layer, where data is designed, conformed and integrated. This is where Data Vault lives. (Or the "design" layer).
Gold: A consumer or presentation layer, tailored for specific uses. (Or the "consume" layer).
Beyond Names: Different organisations will give the layers different names. Patrick has seen "curated, coherent and intelligent zone" or "cell, source-aligned and integration layer", "design" and "consumption" layer.
Iterative: Patrick notes his architectures are iterative, evolving with customer needs and clever ideas they have. He always works with clever people who know what they want. He often uses whiteboarding with customers to get everyone in the room to discuss what to use. His libraries are labelled "lib_" and this comes from working with a scientist who thought it useful to label them all as libraries, due to his SAS experience.

5. Merging Data Vault and Medallion Architecture

Data Vault in Silver: Patrick positions the Data Vault modelling technique (hubs, links, satellites) within the silver (or coherent) layer. He sees the core modeling technique for Vault as fitting here.
Persistent Staging vs Raw Vault: Patrick emphasizes the distinction between a persistent staging area (bronze) and the raw vault in silver.
Bronze: mirrors the source system as much as possible. The physical tables mirror the source.
Raw Vault: is where conformed Hubs, Links and Satellites live.
Raw Vault: Hubs, Links and Satellites in Data Vault 2.0 reflect the source, with changes structured into change tracking. Satellites only track true changes. Links depict business processes from source. Hubs are the pins that hold a data platform together.
Not all data in the landing zone is equal, so it makes sense to split it out. It is easy to purge or archive data in the bronze area if required.
Business Vault: Patrick advocates for a "sparsely modeled" business vault. Where you transform data not available in the source, and add intelligence to the Data Vault, "you simply expand it with sat underscore bv underscore whatever with those persisted attributes from that business rule that you’ve developed."
Data Auditing: The business vault inherits the same auditability as the rest of the raw vault. He thinks the identifiers should be split into it's own satellite table for easier GDPR management. Instead of deleting, you should scrub the data to protect it, but keep the hash div, so you know when it turns up again.
Complexity Location: They discuss where to solve complex issues (mapping keys, etc). Patrick recommends solving them earlier in the process (source, pre-staging, business vault) rather than leaving it to users (the "wooden spoon" option). The earlier you solve it, the cheaper it will be, and the simpler your business terms will become.
Source Specific Vaults: Where you have customer keys in different systems, the decision must be made on where to do the key conformity. Do we conform in bronze? (No, says Shane). You can choose whether to have source-specific hubs to begin with or conform the keys to a single hub straight away. Conforming hub is good practice.
Business Vault (again): Good practice means that raw holds the data, then create additional Business Vault objects where we need to infer, conform or do "bad things" to it to make it fit for purpose. Patrick notes that Business Vault should be as small as possible, a perfect world would have no need of a business vault. Should you copy raw data to business vault? This should not be done unless there is a good reason to do so. Everything in business vault should be declarative and item potent. You also should think about "separation of business rules and outcomes" so that you can write business logic in any language you want.

6. Patterns, Principles, and Policies

Patterns: Reusable solutions to common problems.
Principles: Fundamental beliefs (e.g., "avoid duplication at all costs"), acting as guides that might not always be followed.
"Ideally we want to virtualise the business vault as much as possible"
Policies: Enforceable rules (e.g., data stored in BRONZE must match the source system). Breaking a policy will cause you problems.
The data stored in BRONZE must match the source system
Computational Governance: Ideally you want code that tells you when a policy has been broken. "Write some code which is useful anyway to say check the schema of that table in bronze, check the schema of the table we got given if they don’t match, somebody’s inferring, conforming, doing bad shit in there, flag it".
Renaming: Should be done only as few times as possible, but the place to do that is not always clear. You might need it for the tooling or a logical or business view. It's better to use views in the Gold area for this purpose. There is an anti-corruption layer in the Silver layer to add column names to columns (but not physically) in order to conform, but this doesn't mean that the column name is changed in the source.
Data Quality: This is another core engineering practice, applied in multiple different places. There needs to be a collection of data quality patterns that are usable throughout the different layers.
Metrics: Includes facts, measures and metrics. They must be carefully considered where to do it.
Facts: A number that comes from a system (quantity)
Measures: Things we do to facts (sum, average, count)
Metrics: formulas, excel like A divided by B, inferring something based on facts or measures.
Metric Vault: Can be a place to put metrics for consumption later on (business vault extension).
Virtual vs Physical: When using views for business rules, you should consider if you want the data to be historical or not, and this will help you to decide whether to virtualise or persist. Sometimes you need to materialise your view for performance and that is ok. It might be a principle to use virtual business rules as much as possible.
Code Reuse: DataVault helps to write small blocks of reusable code (like YAML).
Master Concepts: The core concepts such as customer, supplier and employee need to be mastered. These can sometimes be weak (invoice status, admin processes) or master concepts. Master concepts should be managed once and reused everywhere.

7. Semantic Language and Analogies

Context is Key: The language and terminology of data work are important for communication and clarity. You need to think about how you refer to different elements (landing, producer, intelligence, consumer) as these are often swapped around, or have different meanings in different contexts. Data people understand this, but others new to the field may struggle.
Standardization: Standardizing language is important when discussing patterns, along with analogies.
Kitchen Analogy: Shane uses a kitchen analogy, where you have a storeroom, kitchen (where work happens) and a server where data is consumed.

8. Further Considerations

Batch and Streaming: How these happen within the architecture was not explored too deeply, but this was also identified as a lens to view the topic.
Medium Articles: Patrick publishes primarily on Medium.
Technical Book: Patrick has a book on Amazon on how to technically implement Data Vault. There are lots of books on how to model it but less on implementing it.

9. Conclusion

The discussion highlights the need for a balanced and pragmatic approach to data architecture, combining the principles of Data Vault modeling with the flexibility of a Medallion architecture. Both Patrick and Shane encourage clear articulation of patterns, principles, and policies to guide decision-making. Their ideas are focused on creating robust, scalable data solutions.

Agentic Mesh Ecosystem Patterns with Eric Broda

Shagility — Mon, 27 Jan 2025 20:12:17 GMT

Join Shane Gibson as he chats with Eric Broda on the patterns required to create an ecosystem to support the use of Agents in enterprise organisations.

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/agiledata-57-agentic-mesh-ecosystem-patterns-with-eric-broda/

Listen to AgileData Podcast Episode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/agentic-mesh-ecosystem-patterns-with-eric-broda/#read

Google NoteBookLLM Briefing

Briefing Document: Agentic Mesh Ecosystems

Source: "Agentic Mesh Ecosystem Patterns with Eric Broda - AgileData.io" Podcast, January 24, 2025.

Introduction:

This podcast interview with Eric Broda explores the concept of the "Agentic Mesh," a novel approach to leveraging AI agents within enterprise organisations. Broda, a tech veteran with a background in API, service mesh, and data mesh implementation, now focuses on building ecosystems for enterprise-grade autonomous agents. This briefing summarises the core ideas and key takeaways from the discussion, focusing on definitions, patterns, and implementation considerations.

Key Definitions:

Generative AI (Gen AI): Broda defines Gen AI as a "superpower" enabled by Large Language Models (LLMs), allowing for natural language interaction and, potentially, reasoning by computers. He sees it evolving from content creation to a cornerstone of broader ecosystems. "It’s a superpower that lets computers be heck of a lot smarter than they have been in the past."
Agent: An agent utilises an LLM to plan and execute tasks using tools. A more sophisticated agent can learn from past interactions and create new capabilities. Importantly, an agent also interacts with its environment via tools, unlike a standard LLM interface. "An agent is, it uses an LLM, large language model. It can plan its activities, it can then execute those activities or tasks, and it can use tools to actually do that."
Agentic Mesh: The Agentic Mesh is defined as an ecosystem where autonomous agents can find each other and collaborate safely, interact, and transact. This ecosystem also needs to be "enterprise-grade". "Agentic mesh is an ecosystem that lets autonomous agents find each other and safely collaborate. Interact and transact." This ecosystem also considers the consumer, producer, agent and operator experiences.

Core Themes and Concepts:

Ecosystem Model (Mesh): The "mesh" is framed as an ecosystem, similar to platforms like Airbnb or Uber. It enables producers (those who create agents) and consumers (those who use agents) to find each other, interact, and transact. Broda emphasises the simplicity of this concept despite confusion created by some in the tech industry. "So for me, a mesh is just an ecosystem. And if I have a data mesh, it lets folks who want to consume data and folks who want to produce data, find each other, interact, collaborate, and transact."
The Five Planes: Broda outlines five key "experience planes" that make up the agentic mesh:
Consumer Plane: The consumer interacts with an agent, via a user-interface, or chat like interface where they are looking to achieve a task. This may initially resemble an "app store for agents", but Broda believes it will evolve towards a more intuitive, chat-based interface, possibly multimodal.
Producer Plane: This plane is for people who build agents. It provides the templates and toolkits needed for planning, execution, and use of tools as well as making agents enterprise grade. This plane also provides monitoring, version and upgrade capabilities.
Governance Plane: This provides policies for agents and for compliance to the organisation's requirements. Here agents and owners are provided the tools to demonstrate the agents are working as expected. Broda also uses the term "certification" instead of governance.
Agent Plane: This plane focuses on how agents find each other, interact, and collaborate. A "super planner" or "super orchestrator" is where a request initially comes into the agent ecosystem. It creates a plan based off of the agents available, then gives the tasks to each agent to execute.
Operator Plane: Provides the technology and platform required to operate the agents such as Kubernetes, cloud technologies and managing LLMs at scale.
Agentic vs. Human Work: Broda argues that while humans will remain in the loop (especially for governance and oversight), agents will increasingly replace the people who do a lot of work in the background. He foresees automation of business processes and a potential decrease in human intervention where those processes can be made repeatable. The key element to this is that current human led processes have many unstructured, untidy parts and this can be handled by AI agents. "My proposition is a lot of those people in the loop today, humans in the loop can be represented by agents."
Microservices Architecture: Broda emphasizes that agents should be treated as microservices – small, independent, and containerised entities. This approach facilitates integration with existing enterprise infrastructure, ensuring security, discoverability, and operational efficiency. "Every agent is, and we’ll talk about this, it’s a microservice. It’s in a container, it’s deployed, it has some endpoints, and it has a way of interacting. It has an LLM, so we have a smart, a very smart microservice."
Importance of Determinism and Repeatability: While LLMs are not 100% deterministic, Broda highlights techniques to improve reliability. He uses the term "repeatable" as an approach to build processes so you can expect a predictable outcome each time, despite the unstructured processes an agent can deal with. He suggests that well-defined policies, appropriate context, schema constraints, and prompt engineering can greatly increase the repeatability and reliability of agent behaviour. He argues the goal is not to achieve 100% deterministic behaviour, but to achieve significantly better outcomes than the current human systems.
Enterprise Grade Agents: Broda defines enterprise-grade as agents that fit into a normal enterprise operating environment and meet service level expectations around discoverability, observability, operability, and security. This can be achieved by implementing agents as microservices with OAuth2 and RBAC, logging and alerting capabilities. He stresses that the current tooling for creating AI agents is not "enterprise grade". "It means that they’re going to, simplistically, they’re going to fit into a regular, normal enterprise’s operating environment and meet the regular, normal, service level expectations that they have."

Challenges & Future Outlook:

Transitioning from POCs: Broda notes that most current AI projects are in the "science experiment" or "proof of concept" stage, often with limited real-world value due to the lack of enterprise-grade agent toolkits. He thinks there will be a shift from this to the adoption of enterprise-grade agents and tools that will accelerate development and adoption.
Governance and Policies: The interview highlights the challenge of implementing robust governance and policies for autonomous agents. It's not a solved problem. Broda sees a need for federated ownership and robust certification mechanisms to ensure agents operate within defined boundaries.
The Agent Plane: The agent plane presents a challenge due to the complexity of building agents that can make independent decisions and work recursively. It needs optimization and contract patterns for it to work effectively.
The Next Gold Rush: Broda states there has been billions of dollars of investment recently by all the major tech companies into the "agentic future." and now is the time to prepare, to stake a claim in this new area.

Conclusion:

Eric Broda's perspective on the Agentic Mesh offers a thought-provoking vision of the future of enterprise AI. His emphasis on ecosystems, microservices, and "enterprise-grade" capabilities provides a practical framework for transitioning from experimental AI projects to real-world business value. While challenges remain, particularly in the areas of governance and agentic interactions, the potential benefits of an agent-based architecture are substantial, making this an area worth close attention.

Key Takeaway: It's about moving from "Ask AI" or basic chatbots to creating autonomous agents that work in an enterprise setting using a microservices approach, with emphasis on the ecosystems, observability and discoverability.

Reliability Engineering of AI Agents with Petr Pascenko

Shagility — Wed, 08 Jan 2025 22:06:23 GMT

Join Shane Gibson as he chats with Petr Pascenko on the pattern of Reliability Engineering of AI Agents.

Listen

Listen on all good podcast hosts or over at:

https://podcast.agiledata.io/e/agiledata-56-reliability-engineering-of-ai-agents-with-petr-pascenko/

Listen to AgileData Podcast Episode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/reliability-engineering-of-ai-agents-with-petr-pascenko/#read

ChatGPT Summary

In the AgileData podcast episode titled “Reliability Engineering of AI Agents,” Shane Gibson converses with Petr Pascenko. Petr discusses the evolution of AI applications, particularly the development of AI agents designed to perform tasks autonomously.

One highlighted project involves creating an AI agent for a German bank to manage compliance with the European Union’s Digital Operational Resilience Act (DORA). This regulation requires banks to assess their contracts against numerous requirements. The AI agent automates the process by consolidating various contract versions into a final document and evaluating it against DORA’s stipulations, thereby reducing manual effort and enhancing accuracy.

Petr emphasises the importance of reliability engineering in developing such AI agents, ensuring they perform consistently and effectively in critical applications like regulatory compliance.

An Experiment - Top Data Trends for 2025 with Coalesce and Google NotebookLLM

Shagility — Thu, 12 Dec 2024 02:13:00 GMT

Join two LLM generated guests as they discuss the Top Data Trends of 2025 Whitepaper published by Coalesce.

This is a different episode. Instead of a human guest, we have two robot guests.

I decided to try and experiment. My experiment was, can I upload a white paper to LLM, have it generate a podcast listen to that podcast in my daily walk and see whether that summary removes the need for me to actually read the white paper.

So in this case, I have grabbed a white paper called Top Data Trends for 2025 from Coalesce, uploaded it to the Google Notebook LLM and got it to generate a podcast with two hosts chatting about the white paper.

Have a listen, let me know what you think.

I'm really keen to understand, do you think this approach is useful, or is it a load of bollocks?

Listen

Listen on all good podcast hosts or over at:

https://agiledata.podbean.com/e/agiledata-55-an-experiment-top-data-trends-for-2025-with-coalesce-and-google-notebooklm/

Listen to AgileData Podcast Episode

Read

Read the podcast transcript at:

https://agiledata.io/podcast/agiledata-podcast/an-experiment-top-data-trends-for-2025-with-coalesce-and-google-notebooklm/#read

Download

You can download the whitepaper from the Coalesce website here:

https://coalesce.io/reports/the-top-data-trends-for-2025/

Google NoteBookLLM Briefing

(How meta is that, a LLM commenting on a podcast created by the same LLM)

Briefing Document: Top Data Trends for 2025 - Based on Coalesce Report

Introduction:

This briefing document summarises the key data trends for 2025, as discussed in a podcast episode featuring AI-generated hosts ("ADI #1" and "ADI #2") dissecting the "Top Data Trends for 2025" report by Coalesce. The podcast format was used as an experiment to see if an LLM summary could remove the need to read the whitepaper.

Key Themes and Ideas:

The Rise of Knowledge Pipelines:

Traditional data pipelines are evolving into "knowledge pipelines." This isn't just about moving data; it's about enabling AI to reason with the data, understand context, and learn from it.
“It’s not just tables and columns, it’s about AI. Understanding all the relationships and the context. Hidden within the data itself. So think about it. Knowledge pipelines, they teach AI to reason about the data, not just process it."
This is crucial for generative AI, allowing it to learn and make smarter decisions.

Multi-Engine Compute:

The idea of having all data in one place is being challenged. Different compute engines are better suited for different tasks (real-time analytics vs. complex machine learning).
A unified storage layer is essential for a multi-engine approach, allowing for flexibility and the ability to "pick the right tool for the job."
Even big players like Snowflake are embracing more flexible, open source approaches such as Iceberg tables.
"The future is going to be about picking the right tool for the job."

Practical AI Takes Center Stage:

2025 will be about demonstrating real-world ROI and business value with AI, moving past the "hype" of 2024.
"2024 was the year of bold AI experiments, but 2025 is going to be all about ROI and tangible business value."
AI startups that cannot demonstrate real value may struggle.

Combating Industry Amnesia with AI:

There's a tendency in tech to reinvent the wheel, forgetting lessons from the past.
AI agents can help by capturing and analyzing vast amounts of historical data, filling in knowledge gaps.
"It’s like using AI to fight AI induced amnesia. These agents can help us learn from the past…"

The Importance of AI Governance:

AI needs to be used thoughtfully and responsibly, with clear guidelines for development, deployment, and use.
AI governance should align with company values and ensure AI is serving people, not the other way around.
“We need to be really thoughtful about why we’re using AI, and what the potential consequences might be."

Data Quality and Culture are Key:

Data quality isn't just a technical issue, it's a cultural one.
"If data literacy isn’t valued and prioritised throughout your entire organisation, AI initiatives are going to struggle."
If people don't trust the data or understand how to use it, even the most sophisticated AI will fail.

Scaling AI Deployments:

The focus shifts to real-world AI deployments at scale.
This brings challenges in terms of MLOps (managing machine learning models) and AIOps (using AI to automate IT operations).
Data teams become even more important in this environment, acting as a bridge between AI hype and business value.

Open Table Formats and Flexibility:

Open table formats like Iceberg become essential in data lakes, providing flexibility and avoiding vendor lock-in.
It's about having the "freedom to use different tools without having to constantly worry about those compatibility issues."

Unlocking the Potential of Unstructured Data:

AI is changing the game in terms of analyzing unstructured data (audio, video, text documents).
Companies are finding ways to structure this data for new AI-driven insights. For example, an insurance company used AI to transcript customer calls, and then rate them, converting that data into structured data.
"Companies are now finding ways to structure this previously untapped data. Opening up these incredible new possibilities for AI driven insights."

Platform Gravity:

Platforms like Snowflake are becoming "centers of gravity" not just for data, but for applications, AI, and decision-making.
This might create vendor lock-in issues, but also makes it simpler to have data and tools in one place for efficiency.

Data as a Product:

Companies are starting to treat data as a product, focusing on quality, reusability, and alignment with business needs.
This leads to companies building purpose-built data platforms and potentially moving away from large, expensive systems.
“People want trust. Building something reliable and actionable. That you have confidence in means providing that transparency for all users.”

Semi-Structured Data Revolution:

It's estimated that 90% of the world's data is semi-structured which highlights the huge potential to be unlocked.
Businesses will need to rethink their practices around storage, security and governance to handle this influx of data.
Combining structured data with semi-structured data can "unlock some seriously game changing insights."
For example, combining call center recordings with customer sales data to understand how agent empathy impacts retention, or combining satellite images with insurance claims.

The Human Element is Key:

Technology is just one piece of the puzzle.
Businesses need to combine technical advancements with a shift in mindset about data.
There's a need to close the gap between IT and the business side. It is not about the technology that holds a business back but rather the people, processes and culture.
Data literacy needs to be embraced organisation wide, so people understand how to interpret data and turn it into action.
"Technology is never the reason why a business can’t succeed or transform. It always comes back to people."
AI governance is essential, considering the ethical implications of AI from the very beginning, being transparent about how systems work, and being inclusive and diverse in the development process.
Data teams are evolving into product teams, requiring new skills and a customer-focused approach.
"Data teams need to become more collaborative, more customer focused, and more agile in their approach."
There's a need to increase awareness about the ethical implications of AI, with honest conversations, and diverse teams who can help prevent bias.

Strategic Business Implications:

Agility and adaptability are crucial for businesses to thrive in the rapidly changing landscape.
Cloud-based data platforms provide the flexibility and scalability needed to respond to changing conditions and capitalize on opportunities.
A customer-centric approach to data is essential, with a focus on creating better customer experiences.
Data privacy and transparency builds trust and can be a differentiator in a competitive landscape.
Companies need a holistic data strategy that encompasses all aspects of data from collection to action and should be aligned with the overall business strategy, not in a silo.
Investing in strong data teams with the right skills is crucial, and fostering a culture that values data and provides opportunities for growth and development is essential.
"Data as a strategic asset. Not just a technical afterthought."

Conclusion:

The podcast highlights that 2025 is set to be a pivotal year for data and AI. Companies need to move away from traditional data practices and embrace new concepts, such as knowledge pipelines, multi-engine compute, and treating data as a product. It is just as important to consider the human side, and build a strong culture of data literacy and governance. A holistic data strategy, aligned with overall business goals, will enable businesses to unlock value, drive innovation, and stay ahead of the game.