Lecture 4: Data Wrangling (2020)

Поделиться
HTML-код
  • Опубликовано: 10 фев 2025
  • You can find the lecture notes and exercises for this lecture at missing.csail....
    Help us caption & translate this video!
    amara.org/v/C1...

Комментарии • 114

  • @sriharshacv7760
    @sriharshacv7760 4 года назад +145

    I like the name 'missing semester'. It truly holds good for most of us.

  • @dr.mikeybee
    @dr.mikeybee 4 года назад +95

    Fun stuff! 30 years ago, when I was a UNIX admin, I had a book titled UNIX Powertools. I used it more often than any of my other books until the pages all fell apart. It covered a lot of things you cover in this course. I love this stuff.

    • @oscarzhang4734
      @oscarzhang4734 4 года назад +1

      if that's the case, probably I should remove that book from my wish list since it is kind of old.

  • @ryanleemartin7758
    @ryanleemartin7758 4 года назад +34

    This dude is a monster. His Rust streams are invaluable gems.

    • @harshteck
      @harshteck 3 года назад

      What’s his name ? And where can I access rust streams

    • @ryanleemartin7758
      @ryanleemartin7758 3 года назад +3

      @@harshteck His name is Jon Gjengset and his channel (same name) is right here on the youtubes!

    • @harshteck
      @harshteck 3 года назад +1

      @@ryanleemartin7758 thank you

  • @AnuragPandey-om6sp
    @AnuragPandey-om6sp 4 года назад +12

    Super fun to learn about such powerful pipelines that can be done just using terminal. Hats off. Best lecture I have ever watched. Goosebumps!! Love

  • @unclerojelio6320
    @unclerojelio6320 4 года назад +282

    I have a problem.
    I will solve the problem with a regular expression.
    Now I have two problems.

    • @kaushaltak007
      @kaushaltak007 4 года назад +2

      lol

    • @sodaumaru9291
      @sodaumaru9291 4 года назад +2

      lol, regular expression confuses me, too

    • @whatsmyname9742
      @whatsmyname9742 4 года назад +1

      that was nice xd xd

    • @wizard7314
      @wizard7314 4 года назад +3

      Very true. But the benefit is that eventually, after enough practise or ROTE learning, you know how to write the regex, and it is widely applicable to many problems.

  • @vvzen
    @vvzen 4 года назад +6

    I am incredibly grateful to this channel. It’s really changing the way I work and it has taken the joy of working in the command line to a whole new level!

  • @tri1033
    @tri1033 4 года назад +30

    Most productive way to spend my quarantine time :) , thanks a lot ❤️

  • @tomryannova
    @tomryannova 10 дней назад

    This is a fantastic showcase of the different things that you can do in a cut/sed/grep/awk etc streaming situation to solve a myriad of problems as the case may warrant. Love the inclusion of R, which in combination with awk seems like it could be used to detect outliers in a file/dataset and flag is with a QA warning. Bad data is an omnipresent danger in data flows and is worse than ever since a lot of data is coming from internet sources ( urls etc ). Some of the worst 'bad' data isn't outright formatting errors or NULLS but data that is valid but not quite right -- it is out of the expected/normal range but doesn't show up in red as a 0/NULL/data omission.

  • @manbingable
    @manbingable 5 лет назад +40

    this is art. what a beautifully technology.

  • @xuefengwong9097
    @xuefengwong9097 4 года назад +2

    This is amazing! I think it's not only useful for students in schools but also veterans who have been working on linux for a long time.

  • @dralfonzo24
    @dralfonzo24 4 года назад +3

    Man, I wish I had done CS. Thanks for making this stuff available to us for free.

  • @verynoobcoder
    @verynoobcoder Год назад +1

    These lectures are fantastic, and Jon is particularly brilliant!

  • @dreadheadsic
    @dreadheadsic 5 лет назад +56

    Amazing lecture. Thank you. Every software engineer should know this.

  • @LuisJimenezr01
    @LuisJimenezr01 4 года назад +15

    This is an extremely educational video, thank you so much for sharing this lecture.

  • @isaacvicente
    @isaacvicente 2 года назад +2

    In 18:00 he wrote a long long command that is not very useful. I mean, instead, he could use the cut command, that precisely does what hes trying to do: show us just what we want.
    A good example would be:
    cat ssh.log | cut -d ' ' -f4
    Witch:
    -d option stands for "delimiter" (a pattern that separates words, for example)
    - f option stands for "field" -- think a field like a column, in this example would be the 4th column
    The cut command is really good when we are trying to read data with some of pattern
    For instance, the /etc/passwd file on a Linux system has a ":" for the delimiter. So, the info is separated by a column character.
    Obs.: Sorry about my English, just learning xD

    • @veerendrasaraswathi
      @veerendrasaraswathi 11 месяцев назад

      @isaacvicente can me tell me the opening terminal used? Vim? connected to github via linux terminal?

  • @jadanabil8044
    @jadanabil8044 4 года назад +5

    Wow. Applause for the quality of content 👏👏👏

  • @jonathanengwall2777
    @jonathanengwall2777 4 года назад +3

    ed works very well. Learn ed, it is simple to learn in fact you have almost everything. Use "." to state the current line and return to command mode, "a" for after-use this to begin editing, and "w" for write-use this before "q" for quit, ed is great for building scripts. Also you need "p" print, for searching; "n" shows the line number. Quitting without writing will leave the file untouched, no matter the number of edits.

  • @jkyu2701
    @jkyu2701 4 года назад +5

    How I wish I could see this 4 or 5 years ago, when I started learning cs.

  • @tkmf3n
    @tkmf3n 3 года назад +1

    love this lecture! I was surprised the same time when his camera capture appearing on screen

  • @knownsenseLabs
    @knownsenseLabs 4 года назад

    at 35:33 u can sort using the -u sort flag instead of sort | uniq ,it can also be done by sort -u | ...

  • @lkdhy-rc7ep
    @lkdhy-rc7ep Год назад

    really fascinating and helpful

  • @vaibhavksh
    @vaibhavksh 4 года назад +2

    Really helpful classes, glad that you shared it with us! 😀

  • @knight024
    @knight024 4 года назад +1

    One tip, if you're giving a lecture with regular expression, or even just bash/csh/tcsh shell - don't use a 'fancy' shell environment, especially one that ends in a nonstandard format of having a "|" pipe character at the end. It makes it confusing for the end-user and not necessary.

  • @hamedgholami261
    @hamedgholami261 Год назад

    This session was gold. Thank you really.

  • @DataAnalyticsIreland
    @DataAnalyticsIreland 4 года назад

    This is a great video , in the middle of this process as part of machine learning video, thanks!

  • @aminebouaita9202
    @aminebouaita9202 4 года назад +1

    Great series of lectures, thank you

  • @samuelabreu4349
    @samuelabreu4349 4 года назад +4

    Incredible class. Thank you

  • @MichaelS-em8id
    @MichaelS-em8id 5 лет назад +3

    the exercises for this lecture are kinda challenging for someone who is a complete beginner in regex. i know the first exercise they provided was a beginner regex tutorial, but the second question raised the roof on the difficulty.

    • @_chip
      @_chip 5 лет назад +17

      It's MIT. The learning curve is steep.

  • @jeffreymagedanz8130
    @jeffreymagedanz8130 3 года назад

    In these examples, cat isn't necessary. You can just specify the file name as an argument to sed, and use one less process.

  • @RawPeds
    @RawPeds 4 года назад

    Very cool. I knew about command lines utils like sed and awk, but never used because I thought they are complicated. Now they make a bit more sense. Thanks.
    By the way, this doesn't look like a lesson about data wrangling. More like: "how much stuff there is in shell and what can i do with it".

  • @oscarzhang4734
    @oscarzhang4734 4 года назад

    22:30 i saw lots of post saying that sed doesn’t support non-greedy match, neither GNU nor BSD. Correct me if I’wrong, I think that non-greedy match won’t actually work on his machine. I tried regex101 website, by default it uses PCRE mode, is that why it works there?

  • @swaggy3987
    @swaggy3987 2 года назад

    this man is not a man but a god

  • @xenialxerous2441
    @xenialxerous2441 4 года назад +1

    Hey!! Super amazing topic, and super amazing session, thanks!! ;)

  • @danielghenghea7104
    @danielghenghea7104 3 года назад +1

    Awesome lecturer!

  • @daleowens7695
    @daleowens7695 4 года назад

    Wait, 13:13, regex operates "in place". In other words, if a regex is to modify a string, the regex operates on the modified string as it's being processed? Understanding the behavior of this particular regex isn't terribly difficult, but the lecturer's wording made is sound like it behaves like the implication of the question I asked.

    • @daleowens7695
      @daleowens7695 4 года назад

      Actually, no this does not appear to be the behavior:
      ± |master ✓| → echo 'abcaabbcb' | sed -E 's/(ab|bc)//g'
      cab
      If this were the case I would expect ab to be removed as well, as once the inner 'abbc' is removed the remaining 'ab' _should_ be removed as well, correct? Could be a case of "peak behind" or something. I'd appreciate input from someone who knows regex better than myself, thanks.

  • @JinayShah
    @JinayShah 4 года назад +61

    So this is why people go to MIT...

    • @ezio934
      @ezio934 4 года назад +13

      A lot of Linux users excluding ubuntu know these tools.

    • @bossebo3535
      @bossebo3535 4 года назад +3

      Ye and anyone doing sysadmin stuff for Linux, should really learn these tools. Atleast learning: awk, sed grep(-P for perl regex) and probably cut aswell

    • @anonymoususer5402
      @anonymoususer5402 4 года назад +2

      @@ezio934 Ubuntu trolls are everywhere 😂, but after using arch I also love arch. But please don't troll.

  • @71sephiroth
    @71sephiroth 2 года назад

    At [12:00] would it be the same result if we wrote: echo 'abcaba' | sed "s/ab//g"?

    • @enisten
      @enisten Год назад

      I think so. If we had a repeating pattern of 'ab's (e.g. abababababc), the asterisk would replace all of them at once with nothing, as opposed to one 'ab' at a time.

  • @胡德顺-h1s
    @胡德顺-h1s 3 года назад

    very wonderful lectures

  • @DutchmanDavid
    @DutchmanDavid 5 лет назад +3

    Protip: If you want to *really* learn Regex, take some kind of parsing course.
    Regexes are powerful enough that you could write a parser in it (because it's a state machine, behind the curtains - though this will likely make more sense after learning how to parse)
    I'm a dirty Windows pleb, but I can still use regex via *visual studio code* to wrangle textual data. It's great.

    • @kubic71
      @kubic71 4 года назад +1

      Regexes correspond to finite state machines and cannot parse recursive structures, you need some kind of grammar to do that (and tool like GNU Bison)

    • @Marshblocker
      @Marshblocker Год назад

      A parser parses context-free grammars, which cannot be implemented using regular expressions since regex can only process a regular language (CFGs are more expressive than regular languages). We use regex in creating a token generator i.e. a lexical analyzer.

  • @AdamS-lo9mr
    @AdamS-lo9mr 7 месяцев назад

    At my school they do this sort of stiff in a lab session for a semester.

  • @AkhilendraGadde
    @AkhilendraGadde 4 года назад +5

    46:13 to 49:08 literally freaks me out!

    • @ezio934
      @ezio934 4 года назад

      Thats actually very common for Linux users.

  • @pbezunartea
    @pbezunartea 4 года назад +8

    Nice lecture! I miss the professor repeating the question asked. It's hard to hear.

    • @hectorcanizales5900
      @hectorcanizales5900 3 года назад +1

      Yeah, I too was flattered that he cares for us internet consumers

  • @patrickren7395
    @patrickren7395 4 года назад +11

    mit quality, hugh?

  • @iLlamas1
    @iLlamas1 3 года назад

    how do we give syntax highlight to the terminal? their terminal looks awesome

  • @pwndumb2903
    @pwndumb2903 4 года назад +1

    Tnahks for the lecture. You can tell what is the plugin that colorized the arguments of command line ?

    • @vappolinario
      @vappolinario 4 года назад +1

      github.com/zsh-users/zsh-syntax-highlighting

  • @imjavierpalma
    @imjavierpalma 2 года назад

    Great lecture!

  • @MrKsuhiyp
    @MrKsuhiyp 4 года назад +3

    04:24 I don't know why there is no one of the audience applaud saying WOW YEAH as like they do when they attend apple confs!

  • @utkarshmaurya6877
    @utkarshmaurya6877 3 года назад

    Can anyone tell me how his commands show up in red untill he completes them??

  • @infavorofdemocracy5770
    @infavorofdemocracy5770 4 года назад +1

    The regex at 27:10 scared the ****** out of me

  • @RenaudAlly
    @RenaudAlly 2 года назад

    Holy fuck. That completely blew me away. My god.

  • @mohamedaminebenmabrouk
    @mohamedaminebenmabrouk 4 года назад

    Thank you so much !! inspiring ❤

  • @trunc8
    @trunc8 4 года назад +2

    Gold!

  • @josephwong2832
    @josephwong2832 4 года назад

    it seems Regex > SQL because you have much of the same functionality and don't have construct/manipulate crazy complicated tables so long as you structure your files correctly(regex case)???

  • @link8649
    @link8649 3 года назад

    Thank you very much.

  • @himanshutank5477
    @himanshutank5477 4 года назад

    which terminal you are using? zsh, fish? where do we look for to get exact terminal settings like you?

    • @johngibson4874
      @johngibson4874 3 года назад

      I think he said in the lecture that he was using fish

  • @김주혜-b8g
    @김주혜-b8g 4 года назад

    thank you for sharing nice lecture

  • @NoEgg4u
    @NoEgg4u 4 года назад +3

    People are commenting on how helpful this lecture was. Helpful how?
    If you already know the commands, syntax, options, shell substitution, etc, then you already know this stuff.
    If you are learning Linux (and I imagine that the students are there to learn Linux), are they really supposed to retain the blizzard of commands, brackets, braces, pipes, etc, from watching someone whip through several examples?
    At the end of his lecture (at 49:44), he asks "Any questions about what we've covered so far?"
    Not one question from anyone in the classroom.
    No one asking a question is not a sign that they understood. Rather, it is a sign that they are all completely lost.
    This instructor's goal was to show off his Linux skills; not impart knowledge to his students.
    Be honest. If you were in that classroom, would you have retained anything useful?
    Would you be able to sit at the terminal and repeat even 10% of his examples?
    The point of taking a class is to learn the subject matter; not to sit in awe of the instructor's wizardry.
    I would rather learn and retain 25% of the topics, than fumble through 100% of the topics.

    • @prasannarajaram
      @prasannarajaram 4 года назад +8

      @Perhaps, you are mistaken. The lecturer is not showing off his skills. He is showing what possibilities lie ahead with the tools in hand in Linux. The purpose of lecture is not to help you memorize course content, but, to show you what is available for you to explore.
      "Education is not the learning of facts, but the training of the mind to think," as Albert Einstein said.
      When you know there are tools to do these stuff, you can think of creative ways to use them in your favour

    • @NoEgg4u
      @NoEgg4u 4 года назад

      @@prasannarajaram If I were paying tens of thousands of $$ to learn a skill, then I want to learn the skill. I do not want to leave, asking myself:
      "What did I just sit through?"
      "Did I pay big $$ for that?"
      "Is there a next step where I will be taught to use those commands?" (it was not mentioned in the video).
      When I went to school, every class taught me a tangible skill. I was asked questions, because the instructor took an interest in ensuring that the pupils understood her teachings. After listening to the teacher, I worked on actual examples, so that my brain performed the actions and processed the material, so that it would stick with me. The instructor did not jump from chapter to chapter at breakneck speed.
      It is fine for the instructor to show some of the advanced items, and give a general introduction to what it entails. But then he needs to break down each part, and spend some time on each part; giving the students time to absorb the material.
      The name of this class was not "Overview of the power of Linux".
      This class did not expand the mind. It spun it in circles. He might as well have tossed in some python and some perl scripting. They wrangle data, too.
      No one needs to pay the cost of a new car to hear and watch his lecture.
      The same "look what you can do with Linux" content can be found in countless, free on-line videos, found all over social media.

    • @fedemoreno613
      @fedemoreno613 4 года назад +1

      @Perhaps Are you worried for what this students pay? Are you the controller of the faire price? Let anyone judge by themselves that. But if you are worried them there is more important good causes I'm a poor guy ok Argentina :) read the news of my country. The videos are a good first approach to the subject. thanks the people who pay that and the people who take the money and share that to all of us

    • @NoEgg4u
      @NoEgg4u 4 года назад +3

      @@fedemoreno613 .
      -- Am I worried for what students pay?
      I never gave it a thought, any more than I thought about how much students pay for transportation, clothing, nutrition, etc. I am puzzled by why you asked that question.
      Are you asking because you worry, and you want to compare your worries with me?
      -- Am I the controller of the faire price?
      I do not understand your question.
      -- Let anyone judge by themselves that.
      Judge what by themselves?
      Whatever they are judging is fine with me. Why would you think it (whatever "it" is) would not be fine with me? I don't understand why you would ask?
      I am also a poor guy in Argentina.
      We live in a small world.

    • @fedemoreno613
      @fedemoreno613 4 года назад

      @@NoEgg4u "fair price" i refer to the answer you gave to Prassana, What a small world!, may be we are a block from each other. it was 2am and I could not sleep and this video was useful to me and then I read a comment with the thumb down. Your comment and you said "No one needs to pay the cost of a new car to hear and watch his lecture", and i said to you "Laissez-faire"

  • @chaselee5440
    @chaselee5440 4 года назад +2

    what font is that? I really like it!

    • @sketchmaster23
      @sketchmaster23 4 года назад +1

      @Bloatman McEmacs Looks like input or firacode. Inconsolata is a sans font. EDIT: So I did some digging because I was also interested in the font haha, and he says in one of his videos from his channel that it's Noto Sans Mono (Noto Mono?)

    • @iduran
      @iduran 4 года назад

      In his dotfiles, the alacritty config, it says it is Noto Sans Mono. github.com/jonhoo/configs/blob/master/gui/.config/alacritty/alacritty.yml

  • @beanlighter9491
    @beanlighter9491 3 года назад +1

    Computers are beautiful

  • @marvin674
    @marvin674 4 года назад

    I'm trying the exercises at the moment and I can't get it working (been trying for some hours now). Are there solutions anywhere? I'm on #2

    • @marvin674
      @marvin674 4 года назад

      I figured out a RegEx to find the words that are matching the criteria: (.*(a|A).*){3}.*[^s]$ but how to I search for these specific words?

  • @kapilrakh
    @kapilrakh 4 года назад +1

    I noticed you have a user called "kodi" 29:07

  • @daksh6752
    @daksh6752 4 года назад +8

    cat ssh.log | less?? Why not less ssh.log?

    • @ekkotron
      @ekkotron 4 года назад +2

      The cat command is the standard way to convert one or more file(s) into a standard input stream that then can be followed by possible multiple piped commands.
      If you work with piped command-lines a lot "cat input-file" is already typed automatic before you start wondering what shell I do next with it...

    • @anirangoncalvesbr
      @anirangoncalvesbr 4 года назад +1

      Daksh is right, but just as ekkotron said and I agree: force of habit.

    • @ezio934
      @ezio934 4 года назад

      Its meant for normies. Most intermediate Linux users know this.

    • @soksamnang2150
      @soksamnang2150 4 года назад +1

      it just like first time you learn how to declare var. int a; a = 10;

  • @zalibecquerel3463
    @zalibecquerel3463 3 года назад +1

    Sweet Jesus! Cat piped to a sed regex piped to sort piped to uniq piped to awk.... PIPED TO R!
    Cool!

  • @samuel91222
    @samuel91222 3 года назад

    I'd have clicked Like 100 times if I can

  • @steveroger4570
    @steveroger4570 4 года назад +7

    others keep talk about new trend data mining data science thing with python, while this guy did all these just by piping a bunch of bash commands together

    • @user-sd2en6pn3z
      @user-sd2en6pn3z 4 года назад +2

      You should check out Jeroen Janssens' "Data Science at the Command Line" www.datascienceatthecommandline.com/ He had a RUclips video back when he was still working on the book but I can't find it right now.

    • @abhijitubale
      @abhijitubale 4 года назад

      @@user-sd2en6pn3z thanks for sharing...this is gold.

  • @fhajji
    @fhajji 4 года назад

    Basically, this is sysadmin 101.

  • @JohnMc5
    @JohnMc5 4 года назад +1

    Can't you just import the huge file into Python and use a normal programming language and editor rather than typing it into the cmd line?

    • @cabdolla
      @cabdolla 4 года назад +4

      The tools in Linux are extraordinarily powerful. Don’t let the command line simplicity deceive you.

  • @rosgori
    @rosgori 4 года назад +1

    Weird, I don't see the option -E in the manual for sed.

    • @felixh1121
      @felixh1121 4 года назад +3

      There are different versions of sed implementations. the default on Mac does have -E option.

  • @thefazledyn
    @thefazledyn 3 года назад

    Things that my school didn't teach me

  • @padreigh
    @padreigh 4 года назад

    Just a ressource for learning /testing your regex knowledge: regexcrossword.com/ - I am currently stuck at the shakespearean ones because I lack the knowledge / english is a 2nd language for me ;)

  • @rohitdhankar9100
    @rohitdhankar9100 4 года назад +2

    lots of cat abuse

  • @oleksandrkachynskyi1013
    @oleksandrkachynskyi1013 4 года назад +2

    Amazing lecture. Thank you.