The Anonymisation Problem - Computerphile

Поделиться
HTML-код
  • Опубликовано: 4 дек 2017
  • Keeping data anonymous seems easy, but keeping identities separate is a big problem. Professor Derek McAuley explains.
    EXTRA BITS: • EXTRA BITS: More Probl...
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Комментарии • 145

  • @PavelJanata
    @PavelJanata 6 лет назад +168

    I remember hearing about a anonymous form in some computer science class.
    First question: Are you a male or female?
    There was just one female in that class

    • @anonymouse7074
      @anonymouse7074 6 лет назад +4

      Lmao

    • @tacticallala8788
      @tacticallala8788 6 лет назад +8

      WRONG, it was a XIR !!

    • @Lorkin32
      @Lorkin32 6 лет назад +10

      this is equally sad and informing! Impressive!

    • @KuraIthys
      @KuraIthys 6 лет назад +1

      Ouch. XD - I know that feeling... >__<
      But yeah... 'Anonymous' Sure. XD

    • @richardtickler8555
      @richardtickler8555 6 лет назад +2

      We had to fill out an anonymous survey at our first day at uni. There were about 100 ppl in my course and they asked for birthday and home town

  • @trejkaz
    @trejkaz 6 лет назад +13

    They tried to do an "anonymous" survey at work, but:
    1. They had a compulsory question where they asked for your team and manager.
    2. The email sent around had a unique ID on the link to the survey.

  • @sinefield5425
    @sinefield5425 6 лет назад +34

    Yeah I see the problem with anonymisation. There's no use pixelating the guy's face if it's just going to show it right next to it in the thumbnail

  • @rentzepopoulos
    @rentzepopoulos 6 лет назад +3

    For a few videos I feel that being allowed to like just once is not enough; this was one of them.

  • @puellanivis
    @puellanivis 6 лет назад +9

    Even with “perfectly” anonymized aggregated data, later releases of that same data can cause individuals to be identifiable from the data if the populations have only slightly altered. Basically, it’s the same kind of math used in interferometers.
    So, let’s say you have “perfectly” anonymized aggregated data about the viewing habits from your whole street over a period of time. Knowing when people leave and enter the group, you can use interference patterns to extrapolate out individuals.
    But here’s one of the cool things that some people have figured out: you can add statistical noise to the data. If you have each individual data point randomly be a lie, at a certain level of aggregation, the statistical noise can be filtered out of the aggregate data, but the more and more narrow you attempt to get, the more this statistical noise begins making the data untrustable.
    So, let’s say that we have the video rental/watches, but each person has a random selection of ~25% of their watches replaced with a completely random selection from the available data. Even if you can deanonymize an individual into a single row, the data you have on their watching habits is fundamentally untrustworthy, because 25% of them are false. You can’t point to a movie and say, “look, they watched this movie!” Because that could have been a lie. But, aggregate the data together, to say 10,000 people, and now the data on most-watched movies is still clear, because the statistical noise can be filtered out.

  • @fetchstixRHD
    @fetchstixRHD 6 лет назад

    It’s beautiful watching these when they are topics that appear in my lectures(!)

  • @igniculus_
    @igniculus_ 6 лет назад

    I never miss a computerphile video

  • @Slarti
    @Slarti 6 лет назад +1

    When I worked as a data analyst with medical data there weer certain rules including you could not anonymise any data for a patient who had a condition that was below a certain prevalence within their general geographical area - this was so that it was not possible to trace individuals through their conditions.

  • @giveussomevodka
    @giveussomevodka 6 лет назад +14

    Can you guys do something on the byzantine general problem?
    Its trendy now with crypto-currency and block chains, but it was always interesting.

  • @rikwisselink-bijker
    @rikwisselink-bijker 6 лет назад +4

    He must have been so proud of his son :)

  • @scottbeard9603
    @scottbeard9603 6 лет назад +2

    A video on the EU’s new General Data Protection Act would be incredibly interesting/useful!

  • @victornpb
    @victornpb 6 лет назад +6

    If he's the only one that filled a bogus zip code, u can still identify him...

    • @grn1
      @grn1 3 года назад

      Not if he chose one of the codes with a lot of people in them.

  • @ButzPunk
    @ButzPunk 6 лет назад +6

    Never realised that UK postcodes say the street number. In Australia, the postcode just tells you the (group of) suburb(s) (or larger region, if you're in the country) that you live in. I can see why UK postcodes are so long now.

    • @BritishBeachcomber
      @BritishBeachcomber 6 лет назад +1

      Bluelightzero or just one in my case. I have my own personal postcode!

    • @iAmTheSquidThing
      @iAmTheSquidThing 6 лет назад

      I believe the first three characters are the region, and the last three characters are the street.

    • @BritishBeachcomber
      @BritishBeachcomber 6 лет назад

      Each postcode consists of between two and four characters, followed by a space, followed by another three characters.
      The first set of characters are the outcode(sometimes known as the outward code) whilst the second set are the incode(sometimes known as the inward code).
      These are used to direct the mail first to a regional sorting office, then to the local destination.

    • @syphon47
      @syphon47 6 лет назад +2

      The fist portion which is a letter (or 2) is called the Post Area (B = Birmingham, LE = Leicester). Including the numbers before the space is the Post District, which is more granular. If you then include the first number after the space you have the Post Sector which is a small region of a few hundred streets (Post sectors vary in size)
      Oh and postcodes also have an extra 2 characters at the end officially, called the delivery point suffix DPS which is basically identifying the letter box. Used for multiple residences within one house number I think
      It's all very fascinating... :-|

  • @2Cerealbox
    @2Cerealbox 6 лет назад +4

    I've been railing against this for years! Thank god someone more visible than me has something to say about it.

  • @willemkossen
    @willemkossen 6 лет назад +1

    Very good video. I'ld love to chat a bit more about these topics with this man.

  • @KaktitsMartins
    @KaktitsMartins 6 лет назад +3

    "people tend to live somewhere"

  • @ivarwind
    @ivarwind 6 лет назад +1

    The problem with a bogus post code of course, is that given all the students fill out the form, the one with the bogus post code comes from the student whose post code is missing in the data.

  • @AnimeReference
    @AnimeReference 6 лет назад +2

    How many post codes do you have? In Australia it is roughly one per suburb (sometimes one per two suburbs) so most of the students have the same post code as the school, otherwise will be from surrounding suburbs at most two away (with a decent population in each unless the school is tiny).

  • @asireprimad
    @asireprimad 2 месяца назад

    How about a follow up video on differential privacy and statistical disclosure control?

  • @TorgieMadison
    @TorgieMadison 6 лет назад

    You should do a video on the courier / generation / uses of OTPs. There's a whole world of intrigue in how OTPs are both so simple, but so impenetrable to hacking.

  • @Super_Cool_Guy
    @Super_Cool_Guy 6 лет назад +1

    That make great sense!

  • @BigDBrian
    @BigDBrian 6 лет назад +5

    If I may humbly suggest you alter the title so it doesn't appear to suggest that anonymity is the problem, that would be great.
    After all, the video is about the opposite. Suggestions: The (re)identification problem; The deanonymisation problem.
    Just to be clear - it's a suggestion and not a demand.

    • @oktw6969
      @oktw6969 6 лет назад

      So you suggest changing a title based purely on the form of political correctness? It is called that way because retaining anonymity on complex data structures becomes a problem.

    • @leftaroundabout
      @leftaroundabout 6 лет назад +2

      In CS, the word “problem” does not have any negative connotation. E.g. the Travelling Salesman problem doesn't discuss how to get rid of the salesman, it discusses a goal the salesman is pursuing and _the problem she's experiencing_ in trying to get there.
      Likewise, the anonymisation problem is the research subject where we try to achieve anonymity. “The re-identification problem”, conversely, would be an internal video an intelligence agency might produce while trying to break that anonymity...

  • @robertdanielpickard
    @robertdanielpickard 6 лет назад

    Great topic!

  • @fkhg1
    @fkhg1 6 лет назад

    anyone knows if the person in background at 9:45 got his bus or did the person missd it?

  • @BlenderDumbass
    @BlenderDumbass 6 лет назад +1

    The point is you have to remove everything unicly indentifying and use a lot of false data to confuse any algorythm

  • @tedchirvasiu
    @tedchirvasiu 6 лет назад +8

    staticksticks

  • @Flankymanga
    @Flankymanga 6 лет назад

    This is exactly why i was thinking not twice but quadruple time what to fill on form when there was a national citizen recount in my country that was reported to be anonymous....

  • @justin_5631
    @justin_5631 6 лет назад +2

    Just noticed this guy works outside a giant lego pyramid.

    • @oclipa
      @oclipa 6 лет назад +2

      Justin _ Actually, all Computerphile videos are created in minecraft, but it is not usually this obvious.

    • @justin_5631
      @justin_5631 6 лет назад +1

      I could correlate this anonymous video with the number of giant minecraft pyramids in the world to discover where the videos are being made.

  • @KipIngram
    @KipIngram 3 месяца назад

    De-anonymizing someone in a situation where anonymity was clearly promised (like in the speaker's son's post code situation) should be a criminal offense with substantial jail time associated with it.

  • @jwenting
    @jwenting 6 лет назад

    I've had more than a few "anonymous surveys" that were sent using personalised links... I tend to not answer such surveys, they're clearly not anonymous.

  • @cpt_nordbart
    @cpt_nordbart 6 лет назад

    What about decensoring. I've heard about cases where blacked out names on some documents where reconstructable.

    • @SuviTuuliAllan
      @SuviTuuliAllan 6 лет назад

      Start using white ink. Problem solved!

  • @linawhatevs8389
    @linawhatevs8389 6 лет назад

    There IS completely bulletproof cryptography: the One Time Pad.
    Something as simple as limiting the output to something like 128 bits should be enough to remove any hope of deanonymizing a gigabyte-sized database.

  • @MrSonny6155
    @MrSonny6155 6 лет назад

    When you realise that you can now watch computer nerds on Computerphile in 4K. Too bad my internet speed is a meme.

  • @DerkvanL
    @DerkvanL 6 лет назад +1

    Your extra bits link is not available.

    • @Computerphile
      @Computerphile  6 лет назад +1

      +DerkvanL thanks for the spot, should be there now >Sean

    • @DerkvanL
      @DerkvanL 6 лет назад

      Computerphile thx, a very interesting topic! Watched it ;)

  • @SuviTuuliAllan
    @SuviTuuliAllan 6 лет назад +1

    How about a video on CJDNS, Hyperboria, and all that other mesh nonsense? Or did you make a video like that already? Well, in any case, get to the details then.

  • @elliot9507
    @elliot9507 6 лет назад

    S'il vous plait, activez la traduction la vidéo à l'aire très intéressante mais malheureusement j'arrive à comprendre qu' 1/3 de ce qu'il dit

  • @sciverzero8197
    @sciverzero8197 6 лет назад

    I really wish google would let me have unlinked 'slightly less nonymous' accounts for things ... >.>

  • @SuviTuuliAllan
    @SuviTuuliAllan 6 лет назад

    So what was his son's name and shoe size?

  • @igorbednarski8048
    @igorbednarski8048 6 лет назад +8

    while It is true that any cypher can be broken given enough time, at a certain level It is not 'a-supercomputer-would-need-afew-years-level' difficult, It becomes 'the-sun-will-burn-out-even-if-you-had-a-planet-sized-quantum-computer-level' difficult

    • @lordcirth
      @lordcirth 6 лет назад +5

      But only if you don't count side-channel attacks. That's how crypto really gets broken.

    • @RnBandCrunk
      @RnBandCrunk 6 лет назад +1

      Igor Bednarski a planet sized quantum computer could easily solve all the cryptography known now in milliseconds

    • @masansr
      @masansr 6 лет назад +2

      I could generate a million character key in a moment, a supercomputer would need (possible characters)^10^6 actions to crack that. Let's say I only use English lowercase letters (although there is no reason to limit yourself like that). That's 26^1000000 actions. Or roughly 2.23x10^1414973. It's estimated that there are 10^50 atoms on Earth. Let's say every atom could perform a calculation every 5x10^-44 seconds (Planck time). Earth would be a computer with frequency of 2x10^94 calculations per second. That's roughly 10^1414879 seconds to crack the code, which is 10^1414779 times longer than the heat death of the Universe.
      Of course, you could get lucky and solve it in, let's say, first 0,05% of guesses, but it would still be long past heat death.
      (Every calculation done by Wolfram Alpha)

    • @lordcirth
      @lordcirth 6 лет назад +7

      masansr Yup. And the NSA would just exploit a bug in your browser, root your machine, and steal the key.

    • @masansr
      @masansr 6 лет назад +1

      Well, if you had Windows 10, they could just ask Microsoft for access to the computer, no need to publicise another bug. That's the problem with such keys - since there is no algorithm, you have to have a copy of it somewhere. But they cannot be cracked.

  • @Interpause
    @Interpause 6 лет назад

    YET I NEED ALL THE DATA I CAN GET

  • @PsychoticusRex
    @PsychoticusRex 6 лет назад

    3 Cheers for someone up-talking OAS! XD

  • @froozynoobfan
    @froozynoobfan 6 лет назад

    What if you scramble the column indexes of each column randomly and ofc minimize/remove any sensitive personaldata with cryptografie (strong enough key)

    • @iAmTheSquidThing
      @iAmTheSquidThing 6 лет назад

      Then the data wouldn't be much use, because you wouldn't be able to find correlations between two different variables.

  • @pnedkov
    @pnedkov 6 лет назад +1

    If his son is the only person fillied a bogus post code he is busted. They can rule out the people who filled their actual post code and can be identified. And let's not forget his father is a professor in that field. How do you protect yourself against that?

    • @oktw6969
      @oktw6969 6 лет назад

      By not having your state intelligence agency ran purely through negative selection.

  • @Baigle1
    @Baigle1 6 лет назад +20

    and Microsoft says that all their keylogging is anonymized lol

    • @tacticallala8788
      @tacticallala8788 6 лет назад +4

      Windows 10 is NS/\ sbywear.

    • @Baigle1
      @Baigle1 6 лет назад +3

      More like Redmond spyware. All that telemetry crap is now hidden away in the kernel. No escaping it, just use a different OS.

    • @Hudgi34
      @Hudgi34 6 лет назад

      yeah just make your own OS

    • @tacticallala8788
      @tacticallala8788 6 лет назад +1

      667Atlas Everything is logged, not only your keystrokes. When one day the NS/\ want to see if you're a danger they'll want to see all your traffic and all your diik pics.

    • @Baigle1
      @Baigle1 6 лет назад +1

      Yes, keylogging. Otherwise "typing and handwriting data" by their terms. By default it is on, and it is one of nearly a hundred or more sources of telemetry data from Windows 10 machines.
      There are court cases going on that considers all data acquired by 3rd parties (Historical Phone Location Records in that case) to be witnesses to a crime, but the impact of your lack of privacy doesn't stop in criminal cases. Its very profitable to know as much about you as possible, and no database software is invulnerable.

  • @motyakskellington7723
    @motyakskellington7723 6 лет назад

    Post-quantum cryptography

  • @exponentmantissa5598
    @exponentmantissa5598 6 лет назад

    Run TAILS all the time and use pseudonyms and aliases.

  • @Cambesa
    @Cambesa 6 лет назад

    Would using AES-256-CBC and double encryption help anonymizing users? I'm thinking of ways to anonymise users in a database

    • @tacticallala8788
      @tacticallala8788 6 лет назад +1

      I learned to salt and encrypt at least 1000 times for secret data like passwords but perhaps it should be done for everything, except searching the db would be a pain.

    • @johnfrancisdoe1563
      @johnfrancisdoe1563 6 лет назад +2

      Cambesa You're not getting the point. This is about getting rid of the personal identity *permanently*. As in deleting it or not getting it in the first place. It's not about protecting the data you do keep.

  • @marcgrec7814
    @marcgrec7814 6 лет назад

    XD

  • @hattrickster33
    @hattrickster33 6 лет назад

    Travel back in time and tell Turing you can crack Enigma in seconds =p

    • @voidvector
      @voidvector 6 лет назад

      Just bring back a bag full of laptop w/power adapters. Given the amount of basic spreadsheet calculations you can do on it (e.g. ballistics, crypto, linear/non-linear optimization, monte carlo sims), it would probably straight up win the war for whichever side that gets it.

  • @judgesmicheal2096
    @judgesmicheal2096 6 лет назад

    "Anonymization" is spelled with a "z" not a "s".

    • @Computerphile
      @Computerphile  6 лет назад

      Depending where you come from.... >Sean

  • @geoffhalsey2184
    @geoffhalsey2184 6 лет назад

    Doesn't a VPN help?

  • @grrr1351
    @grrr1351 6 лет назад

    This is how FBI tracks people using bitcoin.

  • @ckay11002
    @ckay11002 6 лет назад +7

    Do androids dream of electric sheep?

  • @IdgaradLyracant
    @IdgaradLyracant 6 лет назад +6

    I did this stuff for nearly a decade, we called it behavioral heuristics in identifying people on the Internet. For example with VPNs, they are pointless, we want behaviors, not IP addresses. Staying anonymous on the Internet is nigh impossible now. Tor and VPNs aren't going to help at this point.

    • @tacticallala8788
      @tacticallala8788 6 лет назад +4

      Are you saying you get the careful people too? The ones who as an example wouldn't add all the same RUclips channels.

  • @KipIngram
    @KipIngram 3 месяца назад

    Actually Enigma isn't THAT easy to break. Not "your average notebook could do it in seconds" easy. It's certainly doable with modern tech, but not really tech that every Joe on the street has under his arm.

  • @redhat7025
    @redhat7025 6 лет назад +1

    NO software, IS UNBREAKABLE
    prison,
    government,
    or human

  • @ObsaSiyo
    @ObsaSiyo 6 лет назад

    Do you guys think the goverement will ever regulate processing power of computers? like guns?

    • @tacticallala8788
      @tacticallala8788 6 лет назад

      As long as they can still take your money for it they'll have lots of excuses ready and the C|/\ will be ready to anonymously mock you for disagrreeing, you fukken tinphoil hat wearing rossian psicho trying to krrack your way into govemment computers.

    • @ObsaSiyo
      @ObsaSiyo 6 лет назад

      Makes sense. so you are saying that as long as the government as access to everyone's computer they will not need to regulate them. However, what about AI in the future? if it lowers the learning curve for doing damage to the government will they regulate what is on the computer instead of what the computer can do..

    • @iAmTheSquidThing
      @iAmTheSquidThing 6 лет назад

      I'm sure they'll try. Every organisation always pushes for more power.

    • @tacticallala8788
      @tacticallala8788 6 лет назад

      They have already gone too far and there will always be someone crazy or greedy enough to take it to the next level.

  • @barefeg
    @barefeg 6 лет назад +19

    Meanwhile millennials throw their name, pictures, videos, locations, preferences, friend networks etc on the internet! Lol

    • @SuviTuuliAllan
      @SuviTuuliAllan 6 лет назад

      My location is the one that aliens are trying to avoid. My preferences include mustard and flavoured ice cream. Would you like to download my DNA as well? It's available on opensnp.org. No, really. See if I have the gene to care. (obviously I do, otherwise I wouldn't be wearing this silly hat right now! my neighbours seem to like it tho since they keep staring at it when I come out of the sauna...)

  • @shubhamshinde3593
    @shubhamshinde3593 6 лет назад +21

    he looks like bert from the big bang theory

    • @maxf130
      @maxf130 6 лет назад +1

      Rock Show

  • @hihtitmamnan
    @hihtitmamnan 6 лет назад

    this guy talks SO LOUD and then quite and then LOUD again... it's so annoying!

  • @aveaoz
    @aveaoz 6 лет назад +1

    FIRST xdddddddddddd

    • @mdkmen
      @mdkmen 6 лет назад +6

      please stop

    • @fetchstixRHD
      @fetchstixRHD 6 лет назад

      I think I’m starting to warm to “First” comments now 😂

    • @namewarvergeben
      @namewarvergeben 6 лет назад

      If people "warm up to it", maybe that'll finally make it stop.

  • @igorbednarski8048
    @igorbednarski8048 6 лет назад

    while It is true that any cypher can be broken given enough time, at a certain level It is not 'a-supercomputer-would-need-afew-years-level' difficult, It becomes 'the-sun-will-burn-out-even-if-you-had-a-planet-sized-quantum-computer-level' difficult

  • @igorbednarski8048
    @igorbednarski8048 6 лет назад +2

    while It is true that any cypher can be broken given enough time, at a certain level It is not 'a-supercomputer-would-need-afew-years-level' difficult, It becomes 'the-sun-will-burn-out-even-if-you-had-a-planet-sized-quantum-computer-level' difficult

    • @Yotanido
      @Yotanido 6 лет назад

      Any cipher can be broken in an instant. You just need to guess right on your first guess.
      Insanely unlikely, yes, but still possible. There will never be an unbreakable cipher.
      The most secure cipher we have right now is the one time pad. The key length is equal to the message length, so there's no point in guessing the key - you can just guess the message. It's the best we'll ever have from a strictly information theory standpoint.
      Perhaps quantum cryptography will save us, but I don't know the first thing about it.