Advertisement
pszemraj

whisper large - big data fall 2022

Oct 9th, 2022
255
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 35.94 KB | None | 0 0
  1. https://youtu.be/b7h7O-_o81k
  2.  
  3. transcription with whisper - large
  4.  
  5. All right, welcome everybody in the lecture hall in the CBB building. Welcome everybody who joined us on zoom. So this is the first lecture of the semester for big data. It's so nice to see you all here with a full lecture hall. We have been missing that so much in the previous two years. It feels good to be to be together. I will continue to provide the hybrid option for some of you who want to join on zoom. This will be throughout the semester. You can do this. We will record the the courses. This is also why the green screen is here because when you watch over YouTube, then you have everything on your screen. So yes, we retain the few lessons from what happened in the past two years and there are a few things that we can keep for the future. So what are we going to do in this lecture? We are going to learn techniques and skills in order to query amounts of data that can be gigantic not just gigabytes of terabytes, but even petabytes or even exabytes of data. So we are going to learn how this is actually done in practice, what are the principles of this, what are the technologies that were invented in the past few years. And on top of that, we have another problem that came together with the amount of data that is stored is that the data is messy. It's a huge mess. It's not like the super nice structure tables we had in the 70s. So we'll speak about that a lot. How to deal with large amounts of messy data. And one of the projects that were which we a lot was done in the past few years. Now we have something working in the Jupiter mode. Azure probably in December or something where you can deal with very large amounts of data yourself with this. So we see plenty of super cool things, such as had up my produce, who already heard of my produce like had up. So we learn about that. That's very nice. I see enthusiasm. How about spark who knows back. Who doesn't know spark never heard. Okay. So these are some of the cool technologies that have emerged and will learn the secrets of how they are actually used. But in order to start the lecture, I always do something different in the first week instead of just starting diving into the actual material. Why we are doing all of that. And especially in terms of scale just to give you a feeling of what have been. When you think of scale, you might also think of the universe and the exploration of the universe. Right. So this is not exactly the big bang. This is what's happening at some right that it's not so far away from actually reproducing what's happened in the big bang. But if you look at our scale. So, so one of us. So one meter 85 let's say so one to two meters. This is the order of magnitude. This is the scale we see every day. Right. So two miles to the earth. The circumference of the earth is 40,000 kilometers. That's that's a lot. By the way, you might be wrong ring. Why is that around number. It's actually not by chance that it's around numbers that this is how we actually define the meter back then. Before it was defined directly with the speed of light with the second. So that's actually right around number. And also convert that because we have units there. So we can also consider that it's 40 mega meters. Then we'll continue to zoom out into the solar system. So around us 150 giga meters. So you see another prefig. All the way to Jupiter we are ready with a thermometer. The entire solar system one meter. And here with us, how this is our galaxy by the way, it's incredible the time that it took us as humankind took us millions of years to actually figure out that the little things we see there are actually just what we are in. So the Milky Way. So this is here a thousand examiners, which you will also call a zeta meter. And then this is the the zooming out as much as we can do it today we are at 138. So this is the background noise as we can see it might be that there's even more behind that but we just don't know we cannot see it. Right. So this is it and here I didn't correct for the actual for the actual expansion of the universe because actually to be even bigger if you consider that it expanded. But what I wanted to do with this little exercise is to show you that we have prefixes and there's very large scales and it took us millions of years in order to reach these scales. And when we look at these scales and go back all the way to the big bang we actually end up studying particles in the small. This is what the physicists are actually doing. And in big data as I would argue it's not so much different. We will spend a lot of the lecture looking at very, very large scales at looking at at huge amounts of data. But we will also spend parts of the lecture on data modeling at very, very small scale why because data is messy. So we also need to learn how to deal with small amounts of data that can be heterogeneous. And to compare data science with physics in many aspects just like physics is the study of the real world we just run experiments in order to see how the real world is is behaving. And data science is the same with big with data we just manipulate data and and and see what what happens. I have a question for you. So we are going to be using through that with the lecture something called the clicker up of it. It's it's done right here. This is the address we have many, many ways of doing that you can go to your browser at this address right here, you do up, you do up up when that it is. So you can see that it's the edge or because I know that's not as new cool fist smartphone. So you can also download and install the app called the adwap free from whatever download center there is for your operating system right. And connect right there and here I need to do something with sharing my screen hopefully it will still. So I think I need to go right there. And I think that this is what I want to do ask you. I would like. Just collect a bit of data on your background. So you can see that now you are your computer scientists. We might have data scientists as well. Maybe a few computational biologists if computer science kids. So here that's an opportunity to test the app you know I'm giving you enough time so that you can make sure that you can use it or they have. 61 on the Internet is in 68. So you can do it from the lecture hall or you can do it from zoom it works just as well. So. So yes, let me put again the slide. Let me. Can you see this do you like. So let me just wrap it you APP dash APP one dot ETHZ dot CH. Let's see if the number seems that the number is slowly converging. Is anybody having difficulties of hasn't hasn't managed to connect yet. You You can see that the number is slowly converging. This is always what happens when you do things live. This is as expected. So you see it's working again and now I'm landing exactly where I was. Okay. And that works. Fabio you can hear me again. Okay awesome. So the sound is working. Perfect. So at least now you see that we are live and that is is not actually pre recorded. Well, you obviously know right but the people on zoom are actually not know. So all right. So. Indeed most of you are computer scientists and data scientists because this is a master's level lecture for computer scientists and data scientists. And it also exists because we worked at some point because it was becoming very large. So now we have big data for engineers for other departments which is offered in spring. So that would be next year. So we have a lot of information systems for engineers which is the equivalence of the bachelor's level database management lecture right where you learn relational database and sequel. But again, for those of you who could not register in the system because you are not in the in the computer science and data science program that should be very few of you, hopefully, just so you know that there are these two lectures for you for the other departments and you're absolutely welcome to to attend this. So right now is now just have to lose the habit of showing directly so where we are is right here. Okay. So now I would like to say a few words about you know how we do science for a very long time we've been doing mathematics and physics. What's the difference between mathematics and physics well you can easily land into a very big philosophy called debate and I know not everybody agrees with me. So I'm just thinking mathematics I tend to say in model logics that mathematics is necessary in the sense that we can do it from an armchair right to just sit and think. So it's the world as it is because the world could have been different maybe the masses of the particles could have been different maybe the laws of physics could have been different there many other things that could have been different. So we have a whole contingency and this we have no other way than through experiments we have to just play around with the world and see what happens and then figure out everything that's contingent. But recently we got computer science that added more to the mix and in particular this is what we can do with the machine and we can also we came up with an entire field of science with theoretical computer science and so on. So this is computer science and data science I see as being this last missing piece that is nicely at a sweet spot between physics and computer science why because like computer science this is all automatable you can do it from Jupiter notebooks or you know with fancy clusters. But it's just like physics because what do you do data science it's to study the data and understand the world as it is we could not do it without data we have to collect data and work on the data it is in that sense that it is epistemic and contingent just like physics we collect data and then learn about the world as it actually is. So this is amazing but basically the reason we are doing that is that we want to learn about the actual facts about our own world and how it works. People knew actually we are already wise a long time ago thousands of years ago that a good decision is based on knowledge not numbers so you need to actually manipulate all of these numbers and all of that data in order to make sense of it. So I would like to go ahead now and continue with the history short history of databases how we start and manage data. And start with the pre history of databases and actually it's extremely long ago probably even more than you might have expected the database lecture thousands of years ago. Because thousands of years ago we were already storing data and managing data how in our brains. It's people who would just observe and you know keep track and remember what's going on and then they would just spread it to everybody else so to their children, grandchildren but also you would have people singing from city to city in order to explain what the what people are doing especially the kings and empires and so on you know to make sure that the knowledge is spread. But there is a problem with that is that information can get distorted and get lost because our brains are not fully reliable and we also need to make sure that it retains over centuries and even millennia right. So this is why we needed to actually solve a problem and the problem has been solved thousands of years ago with the invention of writing writing was the first time that we actually figured out that we can store data with information on data. And we have a tablet that's how it was done first in a way that this is preserved through thousands of years and in fact we still have some of it today and this is why we can actually do history. This is what most historians would consider the start of history before that we kind of lost everything we don't have anything left but this here gives us a way of actually learning about what our ancestors had been doing. And I'm going to show you something even more super cool and awesome this is this thing here. So this is a tablet called the limpton 322. I don't know what you think but this looks a lot like some sort of tablet not a clay tablet smart tablet with a spreadsheet software installed on it. And the data was structured as a relational table this is a relational table relational tables are thousands of years old. And then anybody know what there is in there what is stored on this tablet. So this is the whole information on you know the P. Algoras theorem right when you have a triangle with a right angle and then you have like three for five is an example of possible and integer links that you could have. Well, this is just a list of some of those that were known at the time. they need that, well, this is mostly for keeping track on the actual land who owns what when there is a flood and then we need to be able to reproduce the actual shapes of the lands, right? And this is also used in order to get right angle in construction. And so this was stored in there so that people would have a database, this is a database somewhere where these numbers are actually stored. There are a few... Creeps all of that and understands what there is in there. I think it's in base 20, if you look closer, everything is in base 20 in there. Everybody is staying the same in me too, did I say? Oh really? Because I have... Yes, seems that Wi-Fi was working fine. But if there is the issue and it repeats, we'll switch definitely to land. Is it back to normal now or do you have difficulty hearing me on Zoom? Otherwise, I'll just continue. And let me know if the difficulty is persisting and the difficulty is it cuts off? Okay. Oh, once in a while. Okay. So let's see if the frequency increases. I'll try to continue and then we'll see if there is any issue that I can maybe try also to reconnect the Wi-Fi or something. All right. And maybe during the break, we can try to if we get a land connection working. All right. Thank you very much for letting us know. Okay. So this is the writing. Then came the printing. What is the problem that printing solves? It solves the problem of making duplicates and copies of the same thing. Because if you want to spread books, for example, you need to replicate them. Back then, you needed people to do it manually. They would sit for hours and just make manual copies of everything. Typically monks were doing that. With printing with a printing press, this was a way with this free printed characters that you just put in the paper and with this, it's justplication. So this was the 16th century. More recently, computers were invented. By the way, there is one in this building, I think, on the on the e-floor below that's one of the early ones we had in the department, a Kray supercomputer. It's worth the visit if you haven't seen it already. But computers automated and made things faster. Because then we could start processing information. We could start manipulating information. We could start querying information in order to be even faster. So this was already kind of the beginning of the true revolution. And in the 60s, we had file systems like directories and files. But back then, this is how people would deal with data. They would basically manipulate files on the disk and directly deal with the files, read files and output files. So this is how it was done. And then, then, then, something happened in 1970. Somebody called Edgar Codd, remember that name, 1970. This was the beginning of database history where the brilliant idea that Edgar Codd had is that people should think of the data on an abstract level in terms of data shapes and model. So tables are quite intuitive. I showed you a clata blade that is thousands of years old. Everybody understands tables. So that was a natural thing to do to say now we should isolate the user from everything that's too complicated on the physical level. They shouldn't have to deal with files and directories and syntax and so on. And we'll just expose everything as tables with rows and columns. And that's it. And the principle behind that was called data independence. We shield the user from what is below. And I'll come back to that in the second unit, probably starting today and then tomorrow on what has been done in the 70s. And then in the 2000s, and this is where we arrived to big data, this wasn't enough. Now we'll explain a lot why this wasn't enough. And more technology schemes. So key deduStores, one of them triple stores with graph databases, column stores, which are very, very large and sparse relational tables and document stores, which basically store collections of trees. So this is what was known as no sequel, not because we don't like sequel, but not only sequel, meaning that there is something beyond sequel. All right, so that was a quick history of databases as it has evolved throughout the centuries. And now I say, you more words on big data. So it's a buzzword, right? With buzzword, it's quite hard to define. So I'm going to try. I'm going to tell you what I think big data is. So first, what we probably can say is that it involves a lot of proprietary technologies, because this is where it was invented. It's not that I'm doing an advertising for any of these, right? I know I'm not advertising. I'm just saying as a fact, a lot of these technologies actually came from companies rather than universities, probably because there's a lot more financial means in order to have very large clusters that you can use in order to store data, right? So this is how it happens. And we are going to look, of course, into what was done. But for many of them, we'll use an open source equivalent, like Hadoup is an example of open source equivalence. Spark is actually open source. Data bricks would be the company behind it. And so on. All right, so we'll see some company names. There is also a way on the internet now there's search engines for actually specifically looking up data that also exists now. But the best approach to big data is to look at it from the perspective of the three views. The three views are volume, variety and velocity, VVV. So we are going to discuss them volume, variety and velocity in the next few minutes and go more into details. So first volume, then we'll do the RCT velocity. Why volume? Well, an answer that you hear from a lot of companies is because we can. So a lot of companies, especially maybe let's say 10 years ago or 15 years ago, it began, started realizing that with so much space available for so cheap, why would we delete any data? Let's just keep it just in case. That was the mindset back then. So let's keep all the data. We might need it in the future, who knows. So of course, as you know, things have changed. Right now, now we think a lot about data protection about who has the right to store data. There is the GDPR in Europe and so on. Right. So just because we can, doesn't need to good idea. Right. But technologically, we can store the, we can store and keep all of that data. And that works across all levels, right. The infrastructure, the data centers, as we will see, the hardware, the software that runs on on it and the technology that was invented, I include here, maybe machine learning, artificial intelligence, and so on. Another reason is that data has value. Something actually of data is the new oil. It's just a new resource that now we can manipulate and store and use and we can extract value from it. Right. So this is also why data is stored. Of course, there's a lot of considerations that are out of scope of this course, right. Because here I'm teaching you the technologies. It doesn't mean that you cannot think critically, of course. Data carries value. It has an impact on the fact that, for example, when you have free products offered over the internet that are collecting your data, right. And this is a reality to also think of. I recommend the course, for example, Big Data Law and Policy, which is a digest science in perspective course that talks about this kind of topics. From a legal perspective, from an ethical perspective, you know, just because you can do it, doesn't mean you should do it. But as I said, in that course, we will focus, we will focus on the actual technologies and that allow us to do these sort of things. Another aspect is that the utility of a joint data set is higher than just the sum of the two. It's the power of the joint. Joins are extremely useful. I'll come back for those of you who might not have those joints in the section two. Joins are super useful when you want to actually cross multiple data sets in order to link records with other records. And this has a lot of value. It's also very expensive. I hope that in this sector, you will understand that a joint is something we try to avoid in some way when we deal with large quantities of data, in particular, in terms of complexity, in terms of big O. O of N is what we love in Big Data. Everything that's O of N, that's something, the sort of things we can spread on clusters and so on. As soon as you're above that, maybe N log N kind of. But as soon as it starts being quite radical, exponential, or my right. So linear is what we love. And hopefully you will develop a feeling of what the complexity, the algorithmic complexity means in Big Data. Another aspect of data is the collection of complete data sets. It's particularly useful for the website that claim to have a complete data set and have a complete search, for example, to find hotel rooms and so on and so on. So completeness is also another aspect of database is. And of course, this is intertwined with the fact that we can actually store all of that data. If you think of the data like a social network with billions of users, it actually fits on a laptop. The list of everybody on our feet on a laptop can compute with in terms of gigabytes, maybe terabytes, but it fits. So this is actually a small, small database if you just have it as a list. So it fits. In terms of scales, I've already explained to you the prefixes. This is my first assignment to you. If you haven't done so already, I'm asking you to learn this by heart. All these prefixes here. Kilomegagi, Gaterapeta, Exa, Zeta, Yota. It's just the powers of 10, but adding three zeros every time. These are standardized international units. Everybody agrees on them. There was a standard organization body who dealt with that. As you can see, it's been done in batches. Not all at the same time because you see that kilo was very early, mega-giga. And then terapeta, Exa, probably were invented at the same time. Why? Because Tetra, Penta, Exa. This is the Greek way of saying four, five, six. If you count here, there's four, Tetra, one, two, three, four groups of zero, one, two, three, four, five, Penta, Gifeta, and six groups of three zeros here, Exa. That's our remember. Tetra, Penta, Exa. Give you terapeta, Exa. Zeta and Yota probably came later. Maybe the new people who gathered say, okay, let's start with the end of the alphabet. We didn't use Z and while yet, right? So this is how these two were added. And then you might think, is that enough? Well, it's remarkable that this is enough to define the length of the universe. This is enough. But I think there's already physics paper out there where they basically run out of zeros. So you can guess how they continued. They basically continued with W, X, and so on. They didn't give any name to that, but they just used W and X as the way to express even more zeros. But this hasn't been standardized yet. All right. So you must know this by heart. Super important because in big data, you will need these units all the time, but typically with bytes, right? When you talk about bytes. So I have another question for you. I'll switch again to the other screen. Which is to ask you what you consider to be big in terms of the amount of data. One year byte, one terabyte, one petabyte, one exabyte. By the way, people on Zoom was it good to connect in the past minutes? Could you hear well? In the worst case, you know, don't worry because I have many, many recordings of previous years. So in the worst case that a recording doesn't go through or whatever there is a problem, I can just reuse some recordings of previous years. So don't worry about that. We'll just have solutions. All right. So this is what's cool with the first week is that we get these large numbers over there. Hopefully you continue to come in the next few weeks. This is actually what I hope. All right. So many of you say petabytes, and I would actually agree with you. I think that petabytes is a good number for big data. Why? I would say because gigabytes and terabytes fit on a single computer. Petabytes is when it's not on very enough. You need to go to a cluster. And exabytes, yes, but this is maybe huge rather than big because exabytes, I would say, is almost the scale of humanity. At least it was recently, but now I suspect that some companies actually are reaching the exabytes. Exabytes. So this is already, you know, the scale keeps shifting in terms of what we perceive as big. Right. Okay. So let me come back to the slides. If I have prepared everything correctly, we should every time land exactly where we left. Okay. So you see, you should be impressed actually here by the progress that we've made because going from, you know, a single computer to clusters and so on and maybe Zeta or your tabites at the scale of mankind, human kinds, it's actually going from our scale to the entire visible universe. Right. So that and that was done in just a few decades. This is what's what's incredibly amazing in just a couple of years, one, two, three years nowadays, we generate as much data as the entire my human kind since the very beginning. In just three years, as much as everything since the very beginning, this is extremely impressive the the the the exponential growth that we have here in generating all of that data. So there is another system of units just checking the time. For those who love powers of two, it was a bit of a mess at some point we use powers of two or not. I'm not asking you to learn that by heart, but just so you know that also exists. Right. If you want to actually mess two to the power of 10, 20 and so on, there's a higher deviation maybe for the larger ones, but that exists. I'm not asking you to know that by heart. Usually we talk in kilobytes, megabytes and so on. It's simple. Okay. Now the shape of the data. Data has shapes table. You already know. We've known that for actually thousands of years, but data can also be shaped like a tree. We will see tree shaped data because this is what messy data looks like, denormalized as we will see. This is XML, JSON, YAML and many other ways of doing that, even data frames. Have you heard of data frames found us? Yes. So this is it also feeds there. Data frames can also be seen as collection of trees. We will see there. These are collections of valid trees, valid JSON, but we'll come back to that. We have graph databases, which we will also spend one week on at the end of the semester. We have data cubes. We also talk about data cubes at the end and we have text. Text is in a different lecture that's in do we have information retrieval for students here to information retrieval last semester. So we actually did that with that six months ago in information retrieval. This is how we did with text. But basically there are five fundamental shapes, tables, trees, cubes, graphs and text. These are the fundamental shapes of data. Is it extremely important to understand the shape of your data set? Because this is what is going to boost your productivity and the performance of your system. If you don't pick the right shape, this is going to be super slow both in productivity and performance. Next, velocity, the third V. So we just keep generating data all the time. Just to give you an impression of why actually we did big data, I'm going to do a quick thought experiments on what has happened over the past few decades. Let's talk about capacity, throughput and lat. The quantity of data you put is the size of your hard drive, for example, a terabyte, eight terabytes. Then the throughput is the speed at which you can read it. How many bytes per second can you actually read from your storage media, hard drive nowadays? And finally, the latency, I think I can actually make this go away. The comments of zoom should go away. All right, so I see that some people on zoom, it's still cutting from time to time, right? So we'll see if we manage lambda during the break. Otherwise, I'll share last year's recording, make sure that you have the same things. When I keep updating, it's not the same things. I keep making the lecture up to date. All right, in 1956, this is what we had. That was the first commercially available hard drive. How big was that? Look at the dimensions. It's like, you know, this big, right? enormous. And we fit in there an amazing five megabytes of data. Can you imagine? This is enormous. We can read it at 12.5 kilobytes characters per second. And the latency was 600 milliseconds. But of course, you need time in order to move the reading head in there, right? Okay, this is what one level looked like. So this is what rotates in there. And then you just read on that. So that was the the IBM RAMAC 350 in 1956. This is what we have today. Well, you can barely see it because it's so small that we need to actually zoom in. This is not just that I could find. 26th. Start DCHC 670. It's right there. 26th, the right of data. The throughput, 250 megabytes per second, the latency of few milliseconds. And the dimensions like this, right, fits in the hands. So now I have another question for you. Time show you actually experienced that you could be on our files. The progress bar typically, you know how it goes with the progress bar. It goes all the way to 99% and then it's there forever. Right. But basically, it's not linear. Why do you think that is? Is it because files have different sizes because the progress bar doesn't refresh regularly and the transfer is actually happening at a uniform rate? Is it not true? You're lucky enough that your laptop always shows uniform progress bars everywhere or is it because CPU usage may vary? Let's have a peek at your answers. We have two competing answers, right? What I think I would say is not going to close. Yeah, there you go. Oh, it's actually reset everything. Well, anyway, you saw it, right? So it's the fact that the files have different sizes. Here's why. When you have large files, the bottleneck is in the throughput. How fast can you actually read the bits from the drive or copy over? And when you have many small files, it's no longer a throughput issue. It's a latency issue. It's that at every file, the head of the hard drive has to move to a different place. So this is the reason why if you measure your the throughput rate, it's going to be regular for large files and for small files, you will have the feeling that it's super slow just because you just keep jumping all the time all over the disk to fetch the new or file. So this is the main reason why it behaves in that way. And this idea of throughput versus latency is going to keep us busy for weeks. We are going to talk about that for cloud storage for HKFS, Hadoop distributed file system and so on and so on. It's going to be central in the way we think the throughput versus the latency, right? Okay, let me move over to the slides. Okay, so this is the progress made in capacity. This is the progress made in throughput. Yes, yes, it's right there. It's small, but it's right there. And this is the progress made in latency. Of course, latency, you want to decrease it, not increase. We are in big trouble. Big, big, big trouble. Because what happens, I can show on a logarithmic scale, but it doesn't make it any better. Imagine with a book that the capacity is the number of words, the throughput is the speed, the number of words we can read per minute. And the latency is how much time you need to get to the shelf and actually pick the book, right? Or the robots at ETH that actually get the books. You should visit the library by the way, it's very interesting to see the little robots working around and not working but sliding around. Anyway, if you divide this by this, you can compute that it takes you 10 hours to read a book today of that size and with that speed. But now, if you go to the future and imagine that the books in two centuries have evolved exactly like how drives in the previous 70 years, right? This has happened. Then this is the new size of the book, two centuries, and this is the new speed. This is the problem. Now it takes you 11,400 to fill. Has massively increased. This is why we cannot just keep going with the same technologies. It just doesn't work, it doesn't scale. We have a big problem, right? So this is why we have this discrepancy right here. So what do we do? We parallelize. So we are going to parallelize a lot in this lecture. We are going to parallelize. So imagine now you spread it all over the planet, every human being picks one and reads it. Now you can do it in 10 hours, right? So you solve that with parallelism. So we are going to be doing that with clusters of machines. Technology is on top of these of these clusters. All right. And the other here, so of course, there is a discrepancy right here that we solve with parallelism. This right here, the throughput versus latency is also very important. This we solve with batch processing. Batch processing is the idea that instead of doing it one by one, you group them in thousands, for example, and you do it a thousand at a time. This is how you solve this problem here. But we'll come back and over and over and over again during the entire semester on these two things. Do we have a question about the connection? Yeah, yeah. No, we really have to deal with that because there's something we need to solve with the Wi-Fi and the connection. Okay. So this is what I conclude about big data. It's all these technologies that help us store and analyze data, solving that issue of the discrepancy between capacity throughput and latency. This is the problem that we are going to solve in this course. Right. Okay. So it's everywhere in the sciences. At some, these are enormous numbers of the number of collisions per second. They have 10,000 servers. They have hundreds of thousands of cores. It keeps growing all over the place. They produce 50 petabytes of data every year. They even throw a lot of weight. They don't even record everything. Right. But even what they keep is extremely large. The same goes for astronomy. We try to map the entire sky. Every single star we can find or objects that we can find and keep track of that. This is also enormous. Right. Billions of objects with the Sloan Digital Sky Survey. The last phase ended in 2020 and now it's already continuing with the next phase. Right. So there is a paper. Actually, I will give it to you as an optional read that we actually contributed in our group. That was in collaboration with physicists actually around some data. And we try to understand why people in order to analyze the collisions of the particles, why are they not using SQL? Because SQL is actually perfect. It shows all of the data and dependence problems, makes it more convenient and so on. And we try to analyze why that is the case. And so they are interesting insights in there. And it's all related to the sort of things that are during semester. That their data is partiness. The data frame like and so on and so on. But if you want to go ahead and read, of course, you can have a look at that. DNA is also the kind of data with sequencing that we analyze. So a very large number of pairs in our body. We are also making progress in gene editing and so on and so on. We can even store data on DNA that was also done by some people formerly at DTH. They managed to store data as DNA pairs and execute relational queries on that directly on DNA. And this is amazing because it shows again data independence. You can do it on your laptop, on a cluster or on DNA or in a clay tablet that works to it. It's a bit slower, but it works as well. All right. So we are almost for the break. I think we can do the break, maybe 15 minutes. And then I'm almost done with this introduction part. Then I tell you about the lecture scope. I'll introduce you to the T18. Some of them are here with us. And then we'll move over to a section about, you know, a brushup of SQL and relational databases. Right. So let's take a quick break, 15 minutes. And I'll see you at quarter past three for the continuation of the lecture. Thank you very much.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement