We helped USAA turn around their big data projects and create a data engineering culture
- Data problems can become such difficult people problems that soldiers are focused on problems at home instead of being shot at in a war zone.
- USAA realized the hard way that data engineers are needed for data scientists to be productive.
- A lack of data engineering culture made it difficult for USAA to execute and create ROI for their data.
Jesse Anderson: 00:02 Thank you for coming and seeing this. As he mentioned, we’re going to be talking about creating a Data Engineering culture at USAA. And here’s our itinerary. We’re going to talk a little bit about that Data Engineering culture, and then we’re going to turn it over to Tom, and Tom is going to talk about what was happening at USAA before they had that Data Engineering culture, what they’re doing now, and what they’re going to do in the future. Then we’re going to top it off with how you can start creating that Data Engineering culture within your own organization. And hopefully, we’re going to see the incredible value of doing that. So a little bit about us. My name is Jesse Anderson. I’m the managing director of Big Data Institute. We help companies on the mentorship side of things. Sometimes not everything is a technical issue. Sometimes things are problems that you’re experiencing are actually management.
Tom Goolsby: 00:49 Hello. Uh, I’m Tom Goolsby, and I’m not sure how many people know about USAA, but USAA is a financial service company primarily focused on the military members and their families. It’s about 97 years old. It’s about 30 billion in net worth. And we have about 34,000 employees. And as I think about this, looking back, I think about our focus on the military members and their families and more so really this alignment to really understand the military member and what they’re going through. And with that, I wanted to just kind of help people understand, it’s not just words on paper, sometimes there’s a mission statement that people have as far as the company goes. But with USAA it’s involved in everything that we do and decisions and the decision-making thinking is this in the best interest of our members.
Tom Goolsby: 01:41 One story that I think of is at an employee meeting, we listened to phone calls between our members. And, really to kind of help people understand this, this phone call that was played was a member on the front lines in Afghanistan. It was a Major and she was calling and she was really upset. She had set things up for her mortgage to be paid through another institution and they weren’t getting the payments and a few months had gone by and then she receives this notice and the way the mail was working between the United States and where she was in a war zone, she was finding out that her home was getting ready to be foreclosed on. So she called USAA, her last resort, what can happen? Can anybody help me? And you can hear the concern in the person’s voice and, and you can hear the empathy and the in the customer service reps voice as well.
Tom Goolsby: 02:34 And this person took over. He really, really just went overboard and made sure that she felt like she was going to be taken care of. He left the phone call and he went and took care of everything then got back on the phone call and just really helped to understand. We took care of it. We were able to get in contact with the institution. We were able to take over what needed to be done. We were able to make these payments and get you caught up and make sure that this doesn’t happen again. And you know, in this sense, and then what’s happening with the employee base at USAA. And you really get kind of this feeling like, okay, this is some of why we’re here and what we’re doing. And I bring that up just to kind of set the tone of like, it’s not just words on paper, you know, a lot of what we’re trying to do. So what enables all this, the data, the data we collect.
Jesse Anderson: 03:25 So let’s talk a little bit more about that data and how that Data Engineering culture happens. One thing I will say, it’s crazy that you’re in a war zone and your biggest thing is not being shot at, it’s that you’re about to get foreclosed on. And that’s really what brings us to light with our data is that our data affects people’s lives. And that said, we have some really sobering statistics on that. This is a Gartner statistic. This isn’t just mine. 85% of big data projects fail to get into production. This is a really sobering thing. You’ve probably been through this entire conference. Maybe this is your first conference at Strata. This is really one of those things where you think, oh, oh, okay, I’ve heard everybody being successful; well, what made them successful? And I don’t want to appear to be a wet blanket on this, those companies that do get their data into production they create incredible value there. And as we’ll talk about later, Tom is going to talk about some of that value that USAA has received from that. So what is a Data Engineering culture? Data as an engineering culture is where the value and importance of data is perceived and understood at all levels of the organization. This is really key and important because sometimes the organizations, it’s just the individual contributors that realize this value. That isn’t the way it should be. It should be at all levels. It should be at the C level to the mid-levels to the VP’s. This is because they need to understand this from the management point of view. Management needs to be fully behind this data culture. One of the manifestations we’ll actually talk about with USAA is getting the right ratio of Data Engineers to Data Scientists.
Jesse Anderson: 05:04 One very common issue here is a many Data Scientists to zero Data Engineers. That’s when you have a really big problem. You need to get that ratio right. Usually, that’s actually inverse. It’s usually one Data Scientists to two to five Data Engineers. Well, why is that? Well, let’s talk about that right now. You may have seen this paper from Google, it’s actually really interesting and it’s showing when people start to think and talk about machine learning, they think that machine learning is the big problem. And from this paper and from my own personal experiences, the issue of creating machine learning and putting that into production is not the biggest problem or the most time-consuming problems. If you can see it in the back ML Code is by far the smallest box.
Jesse Anderson: 05:57 What we have are the issues around that. These are generally Data Engineering issues and that’s where the vast majority of our time is spent. If we were to look at it without having looked at something like this or I talk like this, everybody thinks the vast majority of what we need are Data Scientists; what we actually need our Data Engineers to facilitate those Data Scientists. When I talk about this with management, especially Data Scientists are consumers of data products. Data Engineers are creators of data products. This is a very key and important insight, and until an organization realizes this, they won’t understand why their Data Scientists are stuck, and we’re going to talk a little bit about this later with Tom. So companies doing this without this data engineering culture, they’re going to get stuck. They’re going to fail, and this getting stuck may not be something that you actually realize.
Jesse Anderson: 06:52 I’ve talked to a lot of companies where they’re thinking, oh wow, we’re really going, we’re really moving with this big data thing. The reality actually is that they didn’t do this, that they’re actually underperforming. They’re doing hello world level things, thinking that they’re advanced. The reality is that they’re doing really something simple because the team got stuck at that level. Their Data Scientists were the ones pushing that data engineering forward instead of having Data Engineers pushing that forward. As you go to conferences and you go to talks oftentimes companies will just kind of assume that there’s a Data Scientist or that there’s a data engineering culture there. So as they present that, you’ve probably heard companies like Lyft and Uber talking about this. It’s kind of assumed from them, It’s crazy to them that there’s a lack of data engineering at these other companies; they just assume it. And so when they’re talking, they don’t say, oh, I’d like a tip of the hat to the data engineering culture or the Data Engineers that made this possible. They don’t specifically say that, it’s kind of assumed. And that’s really, really, really important. So now we’ll turn it back over to Tom, he’s going to talk about some of the history of USAA with that data engineering culture.
Tom Goolsby: 08:01 So I’ll, I’ll go back to what brought me to USAA. So I started at USAA about 10 years ago, and I was originally brought in as a part of the data warehousing team and the data mart team. And there was really about building out data as an asset and how do you get that data into information; so the information layer. And you know, just to kind of give that background of me and my involvement. So then step forward about five years and what we were looking at then is, I was given about eight teams of Data Scientists and these were really smart people. And our focus then was how do we better detect and predict life events from all of our data, transactional data, marketing data, demographic data, all the data that we could possibly put together. And while after about two years, we didn’t have a lot of findings.
Tom Goolsby: 08:50 Some of these things we had to kind of step back as we found out. Some of the data that was structured the way we had it wasn’t going to help us in the future. And so, as these things started occurring about two years ago, some of what happened is, my area is a business area. We were able to take over our core data set for interactions; and so all of our interaction data as a business unit, we took it over from IT. And I was like, how does that happen? And it’s like, well, the business area, what we were able to say and, and provide was, we’re the area that’s really held accountable. Whenever there’s a problem or a finding and something’s needed with this dataset, they call us, they typically don’t call IT, or IT then comes to us.
Tom Goolsby: 09:40 And so, here we became the owner of this data and I have Data Scientists, Data Analysts and Data Engineers. Now. I didn’t at the time. And really trying to accomplish and figure out things and I kept coming into this gap of, we’re trying to solve this, I need people to help us, What, what’s missing? And, and as I, two years ago, really got focused than was, we need to get involved with something like O’Reilly and see what is it we are missing. And this is where a prime example of data engineering kind of reared its head. And from that, we had some really good examples and started working with Jesse Anderson actually, and he was able to come in and show us like, here’s what’s missing, here’s the difference between your DBA and the person that should be helping you on an advanced programming perspective.
Tom Goolsby: 10:36 And I had a Data Scientist who was trying to do all these things for me two and a half years ago. And he’s really smart, a physicist, Psychics Bachelor’s, physics Master’s Degree, another Master’s Degree in data and analytics. So from a math perspective, he’s really strong. But on the advanced programming side, that was something that was missing and he was having to try and figure it out because, as what was needed and what he needed was somebody to actually provide him with the data in the format that he needed in the way that he needed it. And so this has continuously been something that we’ve been struggling against. So moving forward, last year, briefing the University of Texas in San Antonio, we have a really good partnership with them. And at the end of the day of this “sharing session” with them, I’m closing the meeting out and I start telling him this story about what we feel like is missing this advanced programming skillset.
Tom Goolsby: 11:47 Coming out of the school. We’ve created pipelines of Decision Science Analysts, Data Analysts, Data Scientists. But from a Data Engineering perspective, there wasn’t a curriculum that somebody could go through from a university, a college, a school, that puts somebody out into the real world and was all of a sudden a junior Data Engineering professional that could be applying for jobs. And we didn’t even have positions that were being posted to kind of identify that. So that’s kind of what started happening over the last years now is a meeting with HR explaining this gap, helping them come to a definition of what a Data Engineer is. What’s the difference between a Data Engineer and a DBA? What’s the difference between a Data Engineer and a Data Scientist? What’s this term machine learning engineer that we’re seeing coming out of the bay area and getting IT’s involvement with this as well.
Tom Goolsby: 12:41 So it’s the combination of things and from the business, whenever we come to work with IT and collaborate with HR, we’re helping drive that understanding and that value proposition so that everybody understands the value. And it’s not just here are some words on paper. How’s that?
Jesse Anderson: 12:58 Yeah, keep on going.
Tom Goolsby: 12:58 Okay. So with that, moving forward now what’s the impact? Where, where are we seeing this go? So now we’re seeing more people with data engineering skills. We engage with Jesse; He came in and he’s trained over a hundred people now on data streaming technology, data engineering technology, scaling up in Kafka and other related skills that were at times, a Google or a YouTube search, and, with that, we’ve really gained an acceleration. In this when I’m searching for partners and the people that we work with is how do we find this gap, how do we accelerate on this gap, how do we accelerate where we are to where we need to be?
Tom Goolsby: 13:53 So that’s really the biggest thing there. Even with UTSA, they came back a month later and said, we reached out to our partners at other universities. The people they knew at Stanford, the people they knew at Google, Amazon, Facebook, and overwhelmingly they said yes, this needs to be something that’s done. So they’ve started working on a Data Engineering curriculum as a specialization. So a master’s of data and analytics with specializations now, data science, Data Engineering. And even now is what we’re dealing with is even some governance.
Tom Goolsby: 14:31 Where does this go now? So really now, it’s more of that understanding of data engineering, more of the alignment with having Data Engineers, not just on the IT side, but even having some of that matrix’d over to the business side to have some Data Engineers working directly with Data Scientists. This is where I feel like the biggest workflow stoppage or delays occur. And what I would say to you guys is, you’re not alone in this. if you’re a manager of a data science team and it feels like you’re not having data science or deliveries that are happening on a 30, 60, 90-day process, that it may be a Data Engineering gap that’s really causing that delay.
Jesse Anderson: 15:22 So could you talk a little bit more about that time when we were talking to some of your Data Scientists about their impediments?
Tom Goolsby: 15:29 Yes. So one of the things that we did is we went and we interviewed the Data Scientists and we interviewed people who were trying to be in roles of a Data Engineer or should be in a role of a Data Engineer. And a real aha moment for us was, in our case, the Data Scientists, we’re given a lot of autonomy and they were given a lot of ability to self direct their work at times while they might’ve been given a priority. Some of the times those priorities were shifted because they weren’t able to get the data, or they weren’t able to get the data in the format that they needed. So then that task of all the tasks that they had would be shifted to the back and it becomes a parking lot item that they may not get back to because they have so many other things in the queue that they can or hope to work on.
Jesse Anderson: 16:21 And was there any kind of estimate of business ROI that you found there?
Tom Goolsby: 16:26 Yeah and this was kind of amazing. The return on this investment was more than my budget over several years. So it was quite high as what we would see as the impact of not doing some of these tasks that as we pull it together and raise awareness subsequently that hey, these are our priorities. How do we get the right people involved to ensure that the Data Engineering work is done and that they’re provided the data in the way that they need it? So over and over again is the more complex the problem was, it seemed like the more there was a need, not just to hire more Data Scientists, but to actually have more of a data engineering approach to provide data in the way that they could use it.
Jesse Anderson: 17:11 So that that definitely was one of the issues. We saw that, and I see this with other organizations, they’ll think that the answer to a data science problem of not, of why they haven’t gotten any more anything out of it or why it’s been a year since then, any ROI is to add more Data Scientists. And could you talk more about some of the things that we saw there specifically? What did it look like when the Data Scientists hit that dead end?
Tom Goolsby: 17:36 Yeah in many cases the Data Scientists just didn’t work on that task.
Jesse Anderson: 17:42 Or they would work on it up to a point and then stop.
Tom Goolsby: 17:45 Yes, that’s exactly it. I’m not sure what everybody else’s experience in the room has been, but as I’ve talked to other people and raise that awareness, I’ve had several people go back to their groups and their teams and then contact me later and say, yeah, we, we found the same things. So in this understanding the complexity, the need for the data and that in some cases it’s even access to the data in the environment that the data science work has been situated. So I’ve seen multiple layers of that occurs. We also tried to set up different environments for Data Scientists to work in and then actually getting the data from another environment and to their environment has been tricky, but not just the data. But the data, the way they need it.
Jesse Anderson: 18:33 Yeah and we’ve also seen that data the way they needed. Also, just the simple exploration for them is difficult. And would you like to talk more about that exploration of the data as well?
Tom Goolsby: 18:47 So the exploration of the data for them can be tricky also because at times their approach to the exploration may be to use what they know. For example, using Spark instead of a SQL database for something small. And here it’s depending on what they know. They tend to go after things and solve things on the data science side with the knowledge that they have. And it may not always be the best-suited thing to solve. And what may run fast and spark on large data may not run as well when it’s smaller in scale in a SQL server for example.
Jesse Anderson: 19:32 Yeah. I think that was one of the biggest learnings that I’ve tried to note to other leaders. Your Data Scientists know a tool named Spark, and it becomes this hammer with which they hit every single problem. Now when you’re a Data Engineer you say, but you’re using the wrong thing. They just don’t know the right tool for the job. And that’s really what’s key here is the Data Engineers are there to help the Data Scientists not to be a finger-wagging sort of thing. They’re there to say, Data Scientists do this, not that, and give them the tools and the resources and the infrastructure to do this right. Otherwise, they really bang their head against the wall on this. While we talked a little bit about the ROI of them hitting a wall, there’s this whole other aspect of ROI of them toiling. Would you like to talk about some of that toiling that we saw of people spending, not at USAA but at other clients? I’ve seen them toil for months. So there are months of a Data Scientists time just spent trying to figure out a Data Engineering problem that would have taken a Data Engineer a day or two. It’s really to that level of toil.
Tom Goolsby: 20:45 Yeah. And here part, part of this, is there’s so much more and you look at, there was a picture up there previously and it had all the boxes and the small box was the machine learning box. And what I’ve really tried to do is, is figuring out like how do I enable, it’s switched the paradeo effect really is because what happens here is that 20, if you think of this as like the outside boxes, is 80% and the 20% is that ML side. I want to switch it to where my Data Scientists gets to do 80% of their work and that in the ML and AI space and the deep learning space and really have the Data Engineer absorb 20% of that big box work. And with that I, I, that’s where I feel like we’re accelerating the most is by taking the work that they’re not uniquely situated for because they haven’t had that advanced programming skill set and they haven’t been working on in that space. They’ve been more working on the math side and it’s not to say that, uh, it’s a generalist skill set that everybody should have, but whenever I have a focused person that can really go in there and with advanced programming skills and get them exactly what they need and, and have things waiting for them even to where they can go from thing to thing, that that to me is the ideal situation is to provide that, provide what they need and have that cycle go back and forth.
Jesse Anderson: 22:17 And there’s a whole other aspect to this toiling. It’s not just that the company loses money, it’s actually that it completely goes against the grain of what a Data Scientist came to do. I work with a lot of different companies, the same story as true. If you go to your Data Scientists and you say, what do you think you were hired for? What are you doing now? What they will say is, I came to do that small box. I came to write that machine learning code. What are you doing now? I’m doing all these other boxes and I’m really frustrated that I’m doing all these other boxes. And that frustration, depending on how much of a Data Engineering culture will result in a few things, it will result, if in many cases the worst case, they’ll actually quit. I was, this is boring to me. This wasn’t what I was hired to do. You pull the, you guys told me one thing and did another and so now I’m going to quit. And these Data Scientists are difficult to hire. They’re difficult to recruit and now you have to recruit that person again. Or in some cases, large numbers of teams will quit because they said, I, this is what, what I wanted to do.
Tom Goolsby: 23:25 Yes. Yeah, I’ve seen that. Exactly. Uh, I go back two and a half years ago before I realized this was an issue. And, and I, you know, hired this person, brought him in and I asked him, I was like, let’s go. Let’s go down this path, please figure this out. I’m like, this is a really smart person, you’ll be able to get this. And you know, he did. But when I look back at the skillsets, the Data Engineers that I work with now provide is it took him weeks to kind of dig in and really figure this out, whereas opposed to the Data Engineer would have just provided it and just had that skill set and been able to do that for them. So there’s, there are many ways this kind of balances back and forth, having that Data Engineering skill set and the data science skill set work together. And really here it’s the, uh, the acceleration and accomplishment that I’m focused on, on getting impact and outcomes for the specific tasks that we have. And in every scenario, whenever there’s something that’s taking a while to accomplish, it comes back to there’s something missing on the Data Engineering side.
Jesse Anderson: 24:31 So you mentioned before, one of the things you want to share is you’re not alone. That’s part of what I do as well. Are there any other kind of sharings that you want to give before we start talking about other things? Or any other learnings in particular?
Tom Goolsby: 24:45 These were the big ones so that you’re not alone aspect is that you’re not the only ones that are dealing with this. I’ve talked to many different people in many different companies and these are fortune 100 companies and even small companies. And interestingly on the smaller companies, they’ll have like one person and that one person is the Data Scientist, the Data Engineer, the person that’s trying to get the data, situate the data, then also do something with the data and they’re really struggling. And then on the larger side is, the teams may be set apart and not working together or understanding the value proposition of that work together. So in my discussions with architects and IT and our other partners, I really try to help understand the value proposition of what we’re trying to accomplish. And then they really start to understand why, and why something needs to be a certain way. And it’s getting that across and helping people understand. It’s like, okay, I understand now. I didn’t understand in the beginning there’s a reason why this data needs to be situated a certain way and it’s not just up to you to figure it out.
Jesse Anderson: 26:03 So I think we’ll have time later on for questions, so do you start thinking about your questions on for both of us. This is one of those rare opportunities to ask some longterm sorts of questions like this, but we’ll be getting that to the end. So now let’s talk about, we’ve learned what is a Data Engineering culture we talked about USAA’s progression through that. Now let’s talk about how you should create your own. First of those is what should your Data Engineering team look like? This is actually a really common question. What are the sorts of skills that the Data Engineering team needs? This comes out of my book. It’s called Data Engineering teams, it’s an entire book that I wrote about, “This is what your Data Engineering team should do, this is how it should act, and these are the skills that that should be part of that.”
Jesse Anderson: 26:48 One is that you need distributed systems. You heard Tom talk a little bit about DBA’s? So DBA is kind of a shorthand that we use for SQL focus people. People who do not know how to program, but they can write sequel code; sometimes their titles are DBA, a SQL developer, ETL developer. Do those people make for good Data Engineers and the answer is they shouldn’t be part of your core team. In other words, they shouldn’t be part of, If you have a two-person team, two of those people should not be ETL Engineers. If you have a much bigger team, you may have the need for them, but they’re not going to have these two critical skills. One is distributed systems.
Tom Goolsby: 27:30 Distributed systems, this is what we’re doing with Hadoop, this is what we’re doing with Spark. It’s how do you distribute out a task, and your average Software Engineer, even your average DBA won’t be able to understand these sorts of concepts. It’s really, really key. And another one is programming. And I list programming second because programming is important. If your only tool is SQL, then you lack the ability to choose the right tool for the job. That is a core, key, important thing that your team needs to be able to do. Yes, sometimes SQL is really the right way to do and I wholeheartedly say, Hey, you use the sequel to do that join because it’s far more efficient. But there’s a lot of data engineering work that goes into that point where you can run a SQL statement and that’s the key. That’s the thing that’s missing is that whole lot of data engineering work that gets it to that point that it’s ready to run SQL on.
Jesse Anderson: 28:25 They also need analysis. Now, this analysis isn’t to the level that a Data Scientist would be, This may be a very simple analysis like counts and that sort of thing. We also have visual and verbal communication. This is important because oftentimes the Data Engineers are creating either dashboards or custom dashboards programmatically and maybe something that Tableau can’t do, so you’ll need to do it programmatically and that’s maybe an issue or maybe something for your data engineering team. Also, you need to be able to talk to them. You need to be able to go up to your Data Engineer and say, Hey, what is that data set look like? What does the schema look like? They need to be able to speak well enough and be able to communicate well enough because they’re the hub of your organization. As you create that data culture, this is how people are going to come and talk to you.
Jesse Anderson: 29:14 They’re going to come into your data engineering organization to figure out what that data is and they need to be able to communicate well on that. You also need a project veteran. This is a person that is there to keep you from doing stupid things. They’re not there to be a finger-wagger. They’re not there to say when I was your age sort of thing. They’re actually there to come in and say, don’t do that, do this. And the business value of don’t do that, do this can be 5 million, can be a million, it could be 10 million. I’ve had some really bad ideas from people who are beginners, so you know who the worst offenders of distributed systems are? They’re people who are beginners to it. You need that project veteran there to prevent you from doing stupid things that are really costly. You may have seen error bands before on graphs and such.
Jesse Anderson: 30:01 Well, the error bands on small data. If you really mess up, you can rewrite that. It’ll be painful, you can rewrite it. But the error bands on big data are really big where if you do something really, really dumb, it can take a year to just to start to rewrite that and redo that. And I’ve experienced that firsthand working with teams. There’s also the importance of Schema. You need to be able to share data. No longer are you exposing data as here’s this rest API that you call now data is displayed as a data product and you need to have the rest of your organization deal with that in a different way. You need people on your team who understand that and if you don’t have people who understand that there won’t be any usage of your data product.
Jesse Anderson: 30:46 And finally your domain knowledge. Your programmers actually need to do that.
Jesse Anderson: 30:55 Something you want to add? No?
Tom Goolsby: 30:55 Go ahead.
Jesse Anderson: 30:55 Ok, so they also need to be multidisciplinary. On the team, if you were to look at the titles of a team, a team is majority Data Engineer and that Data Engineer, just to reiterate the definition of that is a Software Engineer who has specialized in big data. That will be your predominant title within the team. That said, that may not be the only title. You may actually have a DBA on that team. Note one DBA, not all DBA’s. This is very key because Software Engineers, we’re really good at the top things, but that schema thing, we’re not very good at. And so you may need somebody on the team that’s there to say, oh, you need that data lineage.
Jesse Anderson: 31:36 You need data governance and you need Schema. There are questions that I ask when I work with teams and the DBA’s are the ones that know this. And this is what really what’s key. So it really is important; a lot of you are managers, might be team leads, maybe architects in an organization. There really is a need for help. So it’s important to take an honest evaluation of the team. Does this sound like your team? Do they have the skills? Do they have the abilities? Do you understand your use case? Did you give them the resources? These are the sort of things that kill me when I see this because I see this too often. Hey, we have this big data project. Okay, what resources did you get? We didn’t get anything. We don’t even have a Hadoop cluster or something like that.
Jesse Anderson: 32:19 This kills me. Give them the resources. Resources mean a cluster to run big data things on. You can’t set your team up for failure like this. This is completely unfair to them. Please don’t do that. You also need to understand if they have a skills gap or an ability gap, that ability gap. Please read that. It’s a term I coined having worked with a lot of individuals on teams. Some individuals on a team on their best day cannot do this and that is an important realization and honest realization that you need to make. Their time is better spent elsewhere. So the effects as we just talked about, it’s going to maximize your ROI. As Tom was just mentioning if you are a year into your big data journey and you have nothing to show for it, more than likely there’s a need to go back to your foundation and figure that out.
Jesse Anderson: 33:11 More than likely there’s an issue with your data engineering organization or the lack thereof. It also prevents us from really wasting time and money and it gets our big data projects, they get them unstuck. And this may be what you’re seeing. You may not know the words to say it, but you’re kind of thinking, “Our data science team, I go to this conference and I hear them doing stuff that isn’t us.” What we need to do is we need to be more like this. What are they doing that’s different? More than likely it’s a data engineering culture that has gotten them unstuck.
Jesse Anderson: 33:45 Well thank you. Let’s go ahead and open it up to questions.
Tom Goolsby: 33:50 In the back, please.
Audience Question: 33:55 You had mentioned organizationally having two, three Data Engineers per Data Scientists. Is that the way your organizational design is? Is structured self teams, self encapsulated teams where you’ve got a Data Scientists who is directing Data Engineers and Data Analysts? Or are those teams shared? Is that functionality shared across?
Jesse Anderson: 34:26 I’ll give you my answer and then I’ll turn it over to Tom. The answer is, it will depend. Depends on the organization. So oftentimes IT, that your Data Engineering came out of IT and your data science came out of analytics. There is the time when they’re better together. Where they’re more efficient. And sometimes at larger organizations, they’re actually more efficient as part of a business unit. Kind of like when Tom was talking about we had the mandate, we had the problem but we didn’t have the resources to fix it. That’s the time when I say yes, you probably need to do that. For other organizations, they won’t have enough Data Engineers, so it’s better to keep them together. And then as you grow that data engineering culture, then you might start putting them into business units.
Tom Goolsby: 35:12 Yeah. And so on our side, the Data Engineering skill set and the things that we’re trying to accomplish, it ends up overlapping. So there are so many things that need a data engineering solution that we fill up the data engineering queue and we reach out to IT help for the people that aren’t actually on the team. And I’ll have people that are matrixed, but then also our Data Scientists end up still doing a little bit more because we don’t have more of a ratio like what Jesse was talking about. Now we’re still working towards that end, but in the meantime, you still have to figure this out.
Jesse Anderson: 35:54 Go ahead.
Audience Question: 35:55 Hi. Can you talk a little bit about how the handoff occurs between Data Engineers and Data Scientists? And especially in an agile world, what has been your experience?
Tom Goolsby: 36:12 So, uh, depending on which framework you’re using, we use SAFe and, depending on which project, it’s either a two or four-week iteration for us, and with our teams, largely what I see happening is they’re working side by side. And so while he’s looking at the last thing that was given to him, the Data Scientist is looking at the last thing that was given to them and trying to make that work. The Data Engineer has what’s already in his queue and he’s working on delivering the next thing. So there’s a constant backlog of work, especially from an agile perspective. But without them sitting close together, I like them sitting close together, but in a distributed environment where you have people working in other states; Skype and those other things are kinds of our friends. And whether using Skype or Slack or something along those lines, there’s some communication that’s going on. It’s telling them, it’s like, “okay, this is available. Can you take a look at this?”
Jesse Anderson: 37:15 Yeah. One thing I’d mention, if you read the book, you’ll see a diagram that I have in there and it’s the data science team and Data Engineering team. There is a really high bandwidth communication link between them. If they’re not on the same team, they are really talking to each other much closer than the rest of the organization. It brings up a thought that I didn’t point out is, you want to know what’s difficult about big data, it’s the fact that you have to go cross-organizational in a way that you never had to before. There’s obviously technical issues, but you’ve never had to be so cross-organizational that Tom on the analytics department and me on the IT side are going to have to work that closely together. And that’s one of the real key difficulties you’re alluding to.
Jesse Anderson: 37:58 A question, could you grab the mic?
Tom Goolsby: 38:03 Thank you.
Audience Question: 38:05 The top skill which you mentioned for the Data Engineer, the distributed system. Can you elaborate more on that a little bit?
Jesse Anderson: 38:12 Sure. So the the the question was around the distributed systems. Distributed systems mean, if you have a system that is distributed where you have various tasks being split up amongst different computers, but it’s the same task, that’s a distributed system. And that’s how we do large scale things quickly; because instead of processing a petabyte on a single machine, we take that petabyte, we split it up and we run that on hundreds of machines. And as a direct result, we’re able to get a result much faster. That whole act of distribution is a difficult thing that teams really need to understand and that’s an entire skill or somebody needs to understand how do we create systems that scale correctly and efficiently and that’s important on that team.
Jesse Anderson: 38:57 Next question.
Audience Question: 39:02 Is this on? Oh, there we go. One of the ways I interpreted your sort of definitions of a data engineering role or team is in helping to be sort of a river guide for tools, techniques and infrastructural bits that can be deployed to support people who are doing data science and analytic work. In a really siloed environment where you don’t have this kind of bounded accountability between the IT teams and that sort of the analytics teams. We need to look ahead a year and say these are the infrastructure bits we need to be able to do stuff. What do you think the first sort of two asks are? If someone’s like, here’s a $1 million check, let’s get a thing on the one year IT roadmap that we will get the largest amount of value out of for our first three Data Engineering hires?
Jesse Anderson: 39:48 Would you like to start?
Tom Goolsby: 39:50 So this is common. One of the things, as I look back over time, is, the data engineering gap has always been there, we just didn’t realize it because of the volume that we were trying to access and look at. And when we brought in a whole bunch of Data Scientists and they were looking at volume, velocity, veracity, you know, all these things, that’s when it really became a big issue. Now as far as looking at an investment, moving forward, understanding where do you think you’re going to be in the next year and three years, in five years, what volume needs you’re going to need, what’s your growth and that, so what capabilities do you think you need now versus what you’re going to need from the growth perspective. Does that?
Audience Question: 40:43 (Inaudible)
Jesse Anderson: 40:49 So let me answer your question. The issue with data engineering is that you don’t choose a technology, you learn about the youth case first. That is one of the most key things I want you to take away from this. I wrote a post on my site, if you want to read it, it’s called, “This is Useless Without Use Cases.” If you choose technology without having understood the use case properly, you will choose the wrong technology because you won’t understand the limitations of that technology relative to the use case. So in your example, I have $1 million. The first thing I do is I hire a Senior Data Engineer, and that Senior Data Engineer goes through and starts going through the company and saying, what sorts of data do we have, what sorts of things can we do?
Jesse Anderson: 41:36 Perhaps a step even behind that is to, as self-serving as it sounds, I do this for companies. I come in and we do a business workshop where we define, first of all, is it a big data problem? This is key and paramount. Don’t go, I’ll haul off and use Spark if you don’t have big data or won’t have big data in the future. It’s not just overkill. This isn’t polishing your resume time. This is you are going to shoot yourself in the foot several times over. Please don’t do that. So it’s starting with do you have big data? What kind of ROI? Let’s have a specific plan in place because I’ve seen companies where they’ll spin up a cluster and they’ll say, Tada, here’s cluster everybody and it’s not used and why isn’t it used? Because nobody was trained, nobody had a specific goal in mind with doing it.
Jesse Anderson: 42:26 Get these goals in mind. Get people to start thinking about this. Then have your Data Engineers start thinking about this. But you notice that the first hire was not a Data Scientist? In my opinion, your first hire is a Data Engineer that gets things ready. Maybe your second, maybe your third hire is a Data Scientist, but it is not your first because your Data Scientists is not the right one to start correlating getting this data ready. I see it pretty often. There’s a single Data Scientist, they will quit after six months. You have a six-month runway before they quit. If you don’t give them something to work on. If you want a follow-up question, we can talk offline.
Audience Question: 43:02 Oh, I guess so. In the interest of time, I have two more questions. So if we’re already, well shooting…
Audience Question: 43:11 Jesse, I was reading, you actually wrote an article on this distinguishing between Data Scientists, Data Engineer, and Machine Learning Engineer. I know that you mentioned it once that this new machine learning engineer role is starting to loom heavy in your world. Can you talk a little bit about how you distinguish between those and in USAA’s case is Data Engineer, kind of doing double duty as a Machine Learning Engineer?
Tom Goolsby: 43:36 At USAA right now it’s a role and if you go out there you’ll see a job posting for a Machine Learning Engineer. And so it’s the person that takes it to production. And one of the ways I distinguished this is, we’ve had some Data Scientists be really creative and create some really amazing things and then go and say, all right, take this to production. So the skill set of taking something to production versus the skill set of creating something really unique and fascinating involves different skillsets. And when you stop that creativity and put them into the world of productionization, I think that you’ve kind of broken something. And for me that’s where I distinguish between the Data Scientist and Machine Learning Engineer because I think the Machine Learning Engineers is the person that works in the production side of IT that has that data science knowledge and the data engineering knowledge and is putting something into production with all of the processes and rigor involved with your company.
Jesse Anderson: 44:43 Yeah. And keep an eye out. I’ve created, as you mentioned, I’ve created the body of work around Data Scientists and Data Engineers and starting to help people understand that. My next thing that I really want to write, it’s going to be published, it’s going to be talking about why a Data Scientists is not a Data Engineer, and go even deeper into that. And then the next thing I plan to write is about the machine learning and the issue with machine learning engineer or the issue with Data Scientists as they write crappy code, they are relative novices and programming. And I’m loath to put somebody’s code who’s never done this. Let’s put it into production and see what happens. The Data Engineer isn’t very good at rewriting that code because they don’t understand the math side, so they may get stuff wrong. And that’s really where the key value prop of MLE is, they understand the math, they understand the distributed system side, and let’s bring that together. Do we have enough time for another question or not?
Jesse Anderson: 45:41 I want to be respectful to everybody’s time. We’ll take your question offline. I apologize. Thank you.