1 + 1 = 1 or Record Deduplication with Python | Flávio Juvenal @ PyBay2018

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024

Комментарии • 21

  • @azulzao
    @azulzao 5 лет назад +7

    Really nice presentation. Thanks for sharing.
    I wish I had watched this video before. I didn't know half of the libraries you used and they can help me a lot on the project I'm currently working on.

  • @tuzyamage915
    @tuzyamage915 Месяц назад

    amazing presentation hatt off to the hard work, get a lot of understanding.

  • @dbiswas
    @dbiswas 3 года назад +1

    This is the best de-duplication python explanation so far. I am working on a self project to build a product and this is a mind opening. Thank you so much. Big kudos to the speaker. Is there a way I can connect with you? May be I can ask him to fly to Seattle and work for a project with a team independently?

  • @manguebeatle
    @manguebeatle 5 лет назад +4

    The runnable slides are available at github.com/vintasoftware/deduplication-slides

  • @DeWitteWilson
    @DeWitteWilson 10 месяцев назад

    Brilliant!

  • @caotrananh
    @caotrananh 3 года назад +1

    Hi, at 24:51, I do not understand how to compute the weighted average. In your tutorial, you set the weighted average values for each columns are 30, 10, 5,10, 30, 15 respectively. Please tell me why you can set such values, or how can I find the material which helps me to compute the values like what you did? Thank you so much.

    • @manguebeatle
      @manguebeatle 3 года назад +1

      Hi, you can set whatever values work best for you. But it's probably better training a ML classifier instead of doing that.

    • @caotrananh
      @caotrananh 3 года назад

      ​@@manguebeatle Thank you so much

  • @caotrananh
    @caotrananh 3 года назад +1

    I moved to the step which has the content like that:
    if not isinstance(deduper, dedupe.StaticDedupe):
    deduper.sample(data_for_dedupe)

    training_filename = 'dedupe-simple-training.json'
    if os.path.exists(training_filename):
    with open(training_filename) as tf:
    deduper.readTraining(tf)
    ----------
    but the error is:
    AttributeError Traceback (most recent call last)
    in
    1 if not isinstance(deduper, dedupe.StaticDedupe):
    ----> 2 deduper.sample(data_for_dedupe)
    3
    4 training_filename = 'dedupe-simple-training.json'
    5 if os.path.exists(training_filename):
    AttributeError: 'SVMDedupe' object has no attribute 'sample'
    ---
    please tell me how to fix it? thanks you so much

    • @caotrananh
      @caotrananh 3 года назад

      I've used the Package like that
      package version
      ----
      dedupe 2.0.6
      dedupe-hcluster 0.3.8
      dedupe-variable-datetime 0.1.5

  • @lechimar1
    @lechimar1 5 лет назад +1

    Muito bom, Flávio. Parabéns. Estou fazendo um artigo sobre o assunto. Me tira uma dúvida: quando vc faz a classificação não precisa daquela etapa de comparação ,certo ? Obrigado

    • @lechimar1
      @lechimar1 5 лет назад

      outra coisa : como recuperar os dados que foram encodados e pré-processados ?

    • @manguebeatle
      @manguebeatle 4 года назад +1

      Olá Michel, precisa sim, a comparação vai te dar os vetores que você vai passar para a classificação de duplicata/não-duplicata.

    • @manguebeatle
      @manguebeatle 4 года назад +1

      @@lechimar1 Guarde mapeamentos de IDs. Você vai precisar fazer isso manualmente.

    • @lechimar1
      @lechimar1 4 года назад +1

      @@manguebeatle Obrigado. Abraços

  • @FirstNameLastName-fv4eu
    @FirstNameLastName-fv4eu 4 года назад +3

    After you die, you will only go to heaven!!! because of blessings from people like me. :) Thanks.

  • @gabrielsartori4222
    @gabrielsartori4222 5 лет назад +1

    slides and codes?

    • @manguebeatle
      @manguebeatle 5 лет назад

      Here: github.com/vintasoftware/deduplication-slides

  • @aGianOstaLgia
    @aGianOstaLgia Год назад

    7:17 , some companies use suffixes as part of their names. What a tradeoff.

    • @manguebeatle
      @manguebeatle Год назад

      Hi! If you care about suffixes, you can separate them in another column and your ML classifier can attribute a high weight to them. It all depends on your use case. Nevertheless, this was just a simple example.