The Evolution of Data Science: From Statistics to Big Data & AI
Download MP30:00: Tech daily.
0:00: AI, your source for technical information.
0:04: This deep dive is sponsored by Stonefly, your trusted solution provider and advisor in enterprise storage, backup, disaster recovery, hyperconverged and VMware, Hyper-V Proxox cluster, AI servers, and public and private cloud.
0:18: Check out stoneFfly.com or email your project requirements to sales at stonefly.com.
0:26: OK, so today we're diving deep into data science.
0:29: We're gonna trace, you know, how it all started and really understand the huge force it's become now.
0:34: It's sort of like getting the quick version, the road map to see how this field just, well, changed everything.
0:38: Business, research, you name it.
0:40: Exactly.
0:40: Well, look at the the key milestones, the people involved.
0:43: All the tech that really pushed it forward and the roles too, right?
0:45: It's not just one job.
0:46: No, definitely not.
0:47: It's about understanding the kind of the DNA of data science, how it all fits together.
0:52: Great.
0:52: So let's go back, way back.
0:54: Data science feels new, but the ideas started popping up what in the 1960s.
0:58: That seems surprisingly early.
0:59: It does, yeah, but people were starting to realize even then that the amount of data was growing and they needed.
1:06: , a new way to think about it, a specific skill set, even though the amount of data back then must seem tiny compared to now.
1:13: Oh, absolutely tiny, almost unimaginable, the scale difference.
1:16: But the concept was emerging and there were key figures, right, like John Tookey back in'62, he wrote the future of data analysis.
1:24: What was his main point?
1:25: Tookey's big contribution, I think, was arguing that data analysis wasn't just, you know, math, he saw it as an empirical science, empirical meaning driven by the data itself, focused on getting real meaning, actual insights from real world stuff, not just theory.
1:41: OK, so less theoretical math, more hands on investigation, precisely.
1:46: It kind of set the.
1:46: for how we explore data now, asking questions, iterating, digging in, right, really using the data.
1:52: OK, then jump forward to 1974.
1:53: Peter's book Concise Survey.
1:55: He did, yeah, repeatedly, and he even defined it.
1:58: He said its usefulness comes from building and handling models of reality, using data models of reality.
2:05: That's interesting.
2:06: So using data to understand the world better, kind of like building digital mirrors, yeah, using data to create these representations we can then study and learn from.
2:13: It's a powerful idea.
2:15: Digital mirrors.
2:15: I like that.
2:16: And it wasn't just individuals, right?
2:18: Groups were forming like the International Association for Statistical Computing, the IASC in'77, right.
2:25: The IASC was all about bridging gaps, gaps between, between traditional statistics, the new computer power that was emerging, and critically the knowledge of experts in different fields, domain expertise, the subject matter experts, exactly.
2:40: Their goal was to work together to turn raw data into useful information.
2:44: And then into actual knowledge or wisdom.
2:46: It's that same mix we see in data science teams today.
2:49: That collaboration sounds key, and Toki again in'77, exploratory data analysis.
2:55: Yeah, Tuki was influential.
2:57: His later work really hammered home the idea of letting the data itself suggest the questions, the hypotheses you should test.
3:06: So don't just start with an idea and test it.
3:08: Well, not just that, he pushed for this interplay, this back and forth between Exploring the data, just looking for patterns, seeing what's there, and then formally testing what you find.
3:19: A dialogue with the data.
3:21: That's a good way to put it, a constant dialogue.
3:23: OK, fast forward a bit.
3:25: Late 20th century, data starts piling up, like really piling up.
3:29: The whole big data thing starts brewing.
3:31: Oh, definitely.
3:31: By the 90s, it was becoming a serious challenge.
3:34: You had things like that 1994 Businessweek cover on database marketing companies were hoarding customer data.
3:40: And then Jacob Zahavi in'99 basically said, look, our old tools just can't handle this volume.
3:45: We need something new.
3:46: The old statistical methods were breaking down.
3:48: Pretty much.
3:49: It was like trying to drink from a fire hose with a straw.
3:52: The scale was just off the charts.
3:54: And around then, these knowledge discovery and databases, or KDD workshops, started in'89.
4:00: Were they important?
4:02: Hugely important.
4:03: They became a hub, really, a place for the academics and the practitioners to actually meet, share what they were learning, and figure out how to get value from these massive databases.
4:13: It grew into a major conference series.
4:15: OK, so the community is growing, the problems are getting bigger, then the early 2000s seemed like a real turning point.
4:23: William S.
4:23: Cleveland's Data Science, an action plan in 2001.
4:28: Yeah, Cleveland's plan was significant because it was so direct.
4:32: He basically called for statistics as a field to broaden its technical scope to really tackle data analysis properly.
4:39: He laid out like six specific technical areas universities should focus on, arguing for dedicated research funding.
4:46: It was a real push to make data science a distinct academic field, a blueprint, almost.
4:50: You could say that, yeah.
4:51: And then the Data Science journal launches in 2002, another sign of maturity for sure.
4:55: Having a dedicated journal, you know, it gives the field legitimacy, a place to publish research, discuss methods, applications, even the legal stuff around data.
5:03: It was a big step.
5:04: OK, so the field is getting formalized, but we still have this massive data problem.
5:09: Then comes Hadoop in 2006.
5:12: Why was that such a game changer?
5:14: , Hadoop.
5:16: Well, it hit a core problem head on.
5:19: Storing and processing huge amounts of data that wasn't neatly organized in traditional databases, what we call non-relational data, like social media posts or sensor data.
5:29: Exactly, stuff that doesn't fit nicely into rows and columns, traditional systems just choked on it.
5:35: Haduk offered this open source way to spread the storage and the processing work across many computers, distributed computing, right?
5:41: Suddenly analyzing billions of tweets or, you know, massive log files became possible.
5:47: It unlocked the potential that was sitting there and all that data.
5:50: It provided the engine room, very good.
5:52: Yes, and then the job title itself, data scientist, or in 2008 it really takes off.
5:57: It does.
5:58: DJ Patil and Jeff Hammerbacker are often credited with popularizing it.
6:01: It just clicked.
6:02: It described this person who could handle the data, knew the tech, but also understood the business side, the bridge builder, kind of, yeah.
6:09: And then of course, Harvard Business Review in 2012 called it the sexiest job of the 21st century, right?
6:15: I remember that.
6:16: That definitely gave it a huge visibility boost.
6:18: Talk about good PR.
6:20: From niche term to sexiest job, quite a leap.
6:23: But the tech kept evolving too.
6:25: No SQL databases came back into focus around 2009.
6:28: And then data lakes in 2011.
6:30: Yeah, these were about flexibility.
6:33: Traditional data warehouses are very structured, very rigid.
6:36: You have to define everything up front, which takes time and effort.
6:40: Exactly.
6:41: NoSQL databases were better for handling messy, unstructured data, and the data lake idea promoted by James Dixon was basically Just dump all your raw data in whatever format into one big storage space, like a real lake, just pour it all in.
6:55: Sort of.
6:55: The idea is you figure out how to structure and analyze it later when you have a specific question.
7:00: It allows for more agility, faster exploration compared to designing a whole warehouse first, more like a giant storage bin.
7:07: Than a neat filing cabinet.
7:09: That's a good way to think about it.
7:10: And just to underscore the sheer amount of data that IBM stat from 2013.
7:14: Oh yeah, mind-blowing.
7:16: They estimated 90% of all the data in the world had been created in just the previous two years.
7:21: 90% in two years.
7:23: It's hard to even grasp that exponential growth.
7:26: It really is, and it hasn't stopped.
7:28: So that kind of brings us up to today.
7:30: Data science isn't emerging anymore.
7:32: It's, well, it's everywhere, absolutely everywhere.
7:35: It's truly interdisciplinary now.
7:37: You find it in business, government, health care, science, engineering.
7:41: You name it.
7:42: And how would you define it now?
7:43: Like IBM's definition?
7:44: IBM describes it as using algorithms, methods, systems to get knowledge and insights from all kinds of data structured, unstructured using analytics, machine learning.
7:55: Basically helping people make better predictions, optimize things, make smarter decisions.
7:59: OK.
8:00: And what does a typical project involve?
8:02: Is there a standard process?
8:03: There's a general life cycle, yeah, it usually starts with gathering the data you need, then the really fun part, cleaning it, getting it ready, the not so glamorous part.
8:12: Hu.
8:12: Often, yes, but crucial.
8:15: Then you explore it, look for patterns, then you build models, train them, and finally, you interpret what The models tell you and importantly communicate those findings from raw stuff to actual insights.
8:26: That's the goal, a journey from data to knowledge.
8:29: And we're even seeing automation creep in, things like auto AI.
8:33: Right.
8:34: Tools are being developed to automate some of the more repetitive steps, data prep, basic model building.
8:40: The idea is to free up data scientists for the harder, more strategic thinking.
8:45: Makes sense.
8:46: But there was also that point about a potential shift towards conservative programming.
8:50: What's that about?
8:51: Yeah, it's an interesting observation.
8:53: The idea is maybe to avoid big mistakes or overly complex systems that break.
8:59: There's a trend towards making smaller sick for more incremental changes in data science software, like small tested updates instead of huge overhaul.
9:07: Kind of more cautious.
9:09: But as Scott Huffman from Google noted, the potential downside is you might stifle the really big breakthrough ideas.
9:15: You guard against mistakes, but maybe also against major innovation.
9:18: , that classic trade-off, stability versus bold leaps.
9:23: Exactly, it's a balancing act.
9:24: OK, let's talk about the people, the data scientist.
9:28: What's their main job?
9:29: Day to day.
9:30: Fundamentally, they work with data to find useful conclusions that help people make decisions.
9:36: So that means finding data, cleaning it, transforming it, visualizing it, analyzing it, building models, the whole life cycle we talked about pretty much.
9:44: Then explaining what it all means providing recommendations, they need to understand the problem they're trying to solve, not just the data.
9:52: Got it.
9:52: And it's not just them, right?
9:53: There's a whole team involved usually.
9:54: Oh definitely.
9:55: It's an ecosystem.
9:56: You've got data analysts often focused more on visualization and initial exploration.
10:02: Data engineers are critical.
10:03: They build and maintain the pipelines, the infrastructure, the plumbers of the data world in a very sophisticated way, yes.
10:11: Then database administrators keeping things running smoothly and securely, machine learning engineers who specialize in building and deploying the algorithms, data architects designing the overall blueprints, statisticians bringing deep theoretical knowledge, business analysts connecting insights to strategy.
10:28: And managers overseeing it all, it takes a village.
10:32: Wow, quite a range of roles.
10:33: It really is a team sport and the skills needed.
10:36: It's a mix, right?
10:37: Tech skills and other stuff.
10:39: Absolutely.
10:39: You need the technical chops, stats math programming like Python or Rs.
10:44: For databases, knowing the big data tools, that's essential, but not enough on its own.
10:49: No way.
10:50: You also need like real curiosity about data, asking good questions.
10:54: You need business sense, understanding the context.
10:56: Communication is huge, explaining complex stuff clearly.
10:59: And teamwork, obviously a real blend.
11:02: OK, so the field itself is still changing fast, isn't it?
11:05: Lines blurring between roles, more specialization.
11:08: Yeah, definitely seeing that over the last, say, 5, 10 years, more specialization.
11:11: The difference between a data analyst and an ML engineer is clearer now.
11:15: Maybe cloud computing tools like Spark, they've driven changes too demanding new skills, and some things haven't changed, like the headache of data cleaning.
11:24: Still a massive challenge.
11:26: Ask any data scientist.
11:27: It consistently takes up a huge chunk of time.
11:30: Garbage in, garbage out.
11:32: It doesn't matter how fancy your model is, if the data is bad, the foundation has to be solid.
11:36: Absolutely.
11:37: And something else is becoming more prominent is ethics, right?
11:42: Responsible AI.
11:43: Hugely important.
11:45: As these tools get more powerful and more embedded in everything, we have to think about the ethics.
11:50: Data privacy, algorithmic bias, fairness, societal impact.
11:54: It's central now.
11:55: Can't ignore it, shouldn't ignore it.
11:57: And we're also seeing more work moving from local machines or servers onto the cloud.
12:02: That's a big shift in how development and deployment happen.
12:05: OK, so looking ahead, what's the future hold?
12:07: Seems like a good field to be in.
12:09: Oh, the job outlook is still incredibly strong.
12:11: High demand projected to continue, and the underlying tech AI, machine learning, deep learning, it just keeps advancing rapidly.
12:19: That's going to keep driving the field forward and the wider impact, more automation.
12:23: We'll likely see more automation in many jobs.
12:25: Yes, predictive modeling will get even better, influencing everything from healthcare diagnoses to, you know, supply chains.
12:33: The stas will remain key, crucial, especially in research and deployment.
12:37: And things like using big data and education for personalized learning that has huge potential too.
12:43: So it sounds like data science isn't slowing down anytime soon.
12:45: It's going to keep shaping things for sure.
12:47: It started with statistics, but it's grown into this massive, vital field driving innovation pretty much everywhere.
12:54: OK, so here's something to leave you, our listener, thinking about.
12:58: As data just keeps growing and growing, seemingly forever, how do you think the job of the data scientist and the tools they use will have to change to keep up?
13:07: How will they continue to shape our world?
13:09: Definitely food for thought.
13:10: Tech daily.AI, your source for technical information.
13:14: This deep dive was sponsored by StoneFly, your trusted solution provider and advisor in enterprise storage, backup disaster recovery, hyperconverged and VMware, Hyper-V, Proxox cluster, AI servers, and public and private cloud.
13:27: Check out stoneFly.com or email your project requirements to sales at stonefly.com.
