Probabilistic Record Linkage of Hospital Patients - Chris Oakman

1 + 1 = 1 or Record Deduplication with Python

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Squad Busters x Transformers - Coming September 16th! 🌠

PlayStation 5 Pro Console - Reveal Trailer

Apple Event - September 9

1 + 1 = 1 or Record Deduplication with Python | Flávio Juvenal @ PyBay2018

SF Python

Просмотров 10 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 11 сен 2024

Комментарии • 21

@azulzao 5 лет назад ⁺⁷
Really nice presentation. Thanks for sharing.
I wish I had watched this video before. I didn't know half of the libraries you used and they can help me a lot on the project I'm currently working on.
@tuzyamage915 Месяц назад
amazing presentation hatt off to the hard work, get a lot of understanding.
@dbiswas 3 года назад ⁺¹
This is the best de-duplication python explanation so far. I am working on a self project to build a product and this is a mind opening. Thank you so much. Big kudos to the speaker. Is there a way I can connect with you? May be I can ask him to fly to Seattle and work for a project with a team independently?
@manguebeatle 5 лет назад ⁺⁴
The runnable slides are available at github.com/vintasoftware/deduplication-slides
@DeWitteWilson 10 месяцев назад
Brilliant!
@caotrananh 3 года назад ⁺¹
Hi, at 24:51, I do not understand how to compute the weighted average. In your tutorial, you set the weighted average values for each columns are 30, 10, 5,10, 30, 15 respectively. Please tell me why you can set such values, or how can I find the material which helps me to compute the values like what you did? Thank you so much.
@manguebeatle 3 года назад ⁺¹
Hi, you can set whatever values work best for you. But it's probably better training a ML classifier instead of doing that.
@caotrananh 3 года назад
@@manguebeatle Thank you so much
@caotrananh 3 года назад ⁺¹
I moved to the step which has the content like that:
if not isinstance(deduper, dedupe.StaticDedupe):
deduper.sample(data_for_dedupe)

training_filename = 'dedupe-simple-training.json'
if os.path.exists(training_filename):
with open(training_filename) as tf:
deduper.readTraining(tf)
----------
but the error is:
AttributeError Traceback (most recent call last)
in
1 if not isinstance(deduper, dedupe.StaticDedupe):
----> 2 deduper.sample(data_for_dedupe)
3
4 training_filename = 'dedupe-simple-training.json'
5 if os.path.exists(training_filename):
AttributeError: 'SVMDedupe' object has no attribute 'sample'
---
please tell me how to fix it? thanks you so much
@caotrananh 3 года назад
I've used the Package like that
package version
----
dedupe 2.0.6
dedupe-hcluster 0.3.8
dedupe-variable-datetime 0.1.5
@lechimar1 5 лет назад ⁺¹
Muito bom, Flávio. Parabéns. Estou fazendo um artigo sobre o assunto. Me tira uma dúvida: quando vc faz a classificação não precisa daquela etapa de comparação ,certo ? Obrigado
@lechimar1 5 лет назад
outra coisa : como recuperar os dados que foram encodados e pré-processados ?
@manguebeatle 4 года назад ⁺¹
Olá Michel, precisa sim, a comparação vai te dar os vetores que você vai passar para a classificação de duplicata/não-duplicata.
@manguebeatle 4 года назад ⁺¹
@@lechimar1 Guarde mapeamentos de IDs. Você vai precisar fazer isso manualmente.
@lechimar1 4 года назад ⁺¹
@@manguebeatle Obrigado. Abraços
@FirstNameLastName-fv4eu 4 года назад ⁺³
After you die, you will only go to heaven!!! because of blessings from people like me. :) Thanks.
@gabrielsartori4222 5 лет назад ⁺¹
slides and codes?
@manguebeatle 5 лет назад
Here: github.com/vintasoftware/deduplication-slides
@aGianOstaLgia Год назад
7:17 , some companies use suffixes as part of their names. What a tradeoff.
@manguebeatle Год назад
Hi! If you care about suffixes, you can separate them in another column and your ML classifier can attribute a high weight to them. It all depends on your use case. Nevertheless, this was just a simple example.

Следующие

Автовоспроизведение

Probabilistic Record Linkage of Hospital Patients - Chris Oakman

Probabilistic Record Linkage of Hospital Patients - Chris Oakman

1 + 1 = 1 or Record Deduplication with Python

1 + 1 = 1 or Record Deduplication with Python

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Squad Busters x Transformers - Coming September 16th! 🌠

Squad Busters x Transformers – Coming September 16th! 🌠

PlayStation 5 Pro Console - Reveal Trailer

PlayStation 5 Pro Console - Reveal Trailer

Apple Event - September 9

Apple Event - September 9

Can I Break 50 With Tony Romo From The Front Tees?

Can I Break 50 With Tony Romo From The Front Tees?

Mike Mull: The Art and Science of Data Matching

Mike Mull: The Art and Science of Data Matching

Finding DUPLICATES IN TABULAR DATA with Jupyter and Prodigy

Finding DUPLICATES IN TABULAR DATA with Jupyter and Prodigy

Carl Meyer - Type-checked Python in the real world - PyCon 2018

Carl Meyer - Type-checked Python in the real world - PyCon 2018

JupyterLab and JupyterHub - Perfect Together | Carol Willing @ PyBay2018

JupyterLab and JupyterHub - Perfect Together | Carol Willing @ PyBay2018

Building out an entity resolution pipeline with Python and dbt, Vouch.us

Building out an entity resolution pipeline with Python and dbt, Vouch.us

Airflow on Kubernetes - Scaling DAG Workflows | Daniel Imberman, Seth Edwards @ PyBay2018

Airflow on Kubernetes - Scaling DAG Workflows | Daniel Imberman, Seth Edwards @ PyBay2018

An Introduction to Data Linking

An Introduction to Data Linking

Splink: a software package for probabilistic record linkage and deduplication at scale

Splink: a software package for probabilistic record linkage and deduplication at scale

Building a Scalable Record Linkage System with Apache Spark, Python 3, and Machine Learning

Building a Scalable Record Linkage System with Apache Spark, Python 3, and Machine Learning

Cool Items!🥰 New Gadgets, Smart Appliances, Kitchen Tools Utensils, Home Cleaning, Beauty #shorts

Cool Items!🥰 New Gadgets, Smart Appliances, Kitchen Tools Utensils, Home Cleaning, Beauty #shorts

Lp. Сердце Вселенной #11 РАЗДЕЛЕНИЕ ЛИЧНОСТИ [Голос в Голове] • Майнкрафт

Lp. Сердце Вселенной #11 РАЗДЕЛЕНИЕ ЛИЧНОСТИ [Голос в Голове] • Майнкрафт

爸爸误以为钱生钱，怎料又被儿子套路了！ #funny #萌娃 #comedy

爸爸误以为钱生钱，怎料又被儿子套路了！ #funny #萌娃 #comedy

Sigma Girl Pizza #funny #memes #comedy

Sigma Girl Pizza #funny #memes #comedy

Жириновский: Украина будет применять беспилотники, погибнут люди! #жириновский #ввж #бпла

Жириновский: Украина будет применять беспилотники, погибнут люди! #жириновский #ввж #бпла

iPhone 16 - презентация Apple 2024

iPhone 16 - презентация Apple 2024

👆🏻Жми на «МЫ поехали в Питер…» и смотри 1 из 48 видео про мою жизнь

👆🏻Жми на «МЫ поехали в Питер…» и смотри 1 из 48 видео про мою жизнь

Разбираем АНАЛИТИКУ БУДУЩЕГО автомобилей

Разбираем АНАЛИТИКУ БУДУЩЕГО автомобилей