Text Mining con R - Don Quijote de la Mancha

Data Scientist Journal

Просмотров 9 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 11 дек 2024
Наука

Комментарии • 13

@florenciavasini974 6 месяцев назад
🎯 Key Takeaways for quick navigation:
00:00 *📚 Introduction to text analysis of Don Quixote by Miguel de Cervantes*
- Introduces the analysis of the book "Don Quixote de la Mancha" using data analytics and software
- Mentions that the book has been downloaded in PDF format and will need cleaning
01:09 *📥 Loading the book text data*
- Loads the required libraries for reading PDFs, text analysis, and graphics
- Loads the two parts of the book's text using pdf_tools package
03:25 *🧹 Cleaning the text corpus*
- Removes tabs, separators, and combines the two parts into a single text corpus
- Considers the period as the minimum analysis structure, representing the end of a sentence
04:50 *🔍 Preparing the text for analysis*
- Rectifies the text by replacing specific patterns with blanks
- Generates a vector of sentences, splitting the text at periods
06:47 *🧺 Additional text cleaning*
- Removes titles, page numbers, and other metadata from the text
- Applies stop words removal to clean the corpus further
08:41 *🗂 Creating a data frame for analysis*
- Assigns an id to each sentence in the data frame for analysis
- Keeps only distinct rows (sentences) to avoid duplicates
10:24 *💻 Tokenizing words*
- Tokenizes the sentences into individual words
- Removes stop words and filters out blank rows
11:36 *🔢 Word frequency analysis*
- Counts the frequency of each word in the text
- Identifies common words like "don", "quijote", "sancho"
12:18 *📝 Updating the stop words lexicon*
- Adds words like "capítulo" and whitespace to the stop words lexicon
- Applies the updated lexicon to further clean the text
13:51 *📊 Word cloud visualization*
- Creates a word cloud visualization of the most frequent words
- Highlights words like "don", "quijote", "sancho", "panza"
16:22 *🔍 Bigram analysis*
- Introduces bigram analysis to understand word contexts and relationships
- Tokenizes the text into bigrams (pairs of words)
18:39 *📈 Visualizing bigram frequencies*
- Counts the frequencies of bigrams in the text
- Visualizes the most frequent bigrams as a network graph
20:25 *🌐 Interpreting the bigram network*
- Explains how the network graph represents word relationships and contexts
- Highlights key characters, themes, and plot points identified through bigram analysis
Made with HARPA AI
@GustavoDelaCruzTovar Месяц назад
---
# Minería de Texto con R - Análisis de Don Quijote
**Intervalo de Tiempo**: 00:00:00 - 00:21:36
## Resumen
- 📖 **Introducción al Análisis de Texto**: Descripción del uso de herramientas de minería de texto en R para analizar literatura clásica, centrándose en *Don Quijote*.
- 🧹 **Preparación de Datos**: Limpieza del texto de *Don Quijote* para su análisis, incluyendo la eliminación de elementos innecesarios como encabezados, pies de página y caracteres no esenciales.
- 🔍 **Tokenización y Eliminación de Palabras Vacías**: Segmentación del texto en oraciones, frases y palabras, filtrando las palabras comunes.
- 📊 **Frecuencia de Palabras y Patrones**: Uso de frecuencias de palabras para identificar temas prominentes y menciones de personajes (como "Don Quijote", "Sancho").
- 🌐 **Análisis de Bigramas para Contexto**: Agrupación de palabras en pares (bigramas) para entender relaciones y temas comunes en el texto.
## Perspectivas Basadas en Números
- **452 vs. 524**: División entre las dos partes del texto de *Don Quijote*, crucial para organizar los datos de análisis.
- **7,874 vs. 17,500**: Cuentas iniciales de palabras reducidas a palabras significativas y únicas después de eliminar duplicados.
- **Conteo de Palabras Principales**: Las palabras más frecuentes incluyen "Don", "Sancho", "Quijote", mostrando el enfoque en el personaje principal y su compañero.
## Ejemplos de Preguntas Exploratorias
1. ¿Cómo maneja el análisis el extenso y repetitivo texto de *Don Quijote*? (*Escribe **E1** para preguntar*)
2. ¿Qué patrones emergen en las relaciones entre personajes a través del análisis de bigramas? (*Escribe **E2** para preguntar*)
3. ¿Cómo mejora la representación visual, como las nubes de palabras, la comprensión de los patrones del texto? (*Escribe **E3** para preguntar*)
## Comandos
- [A] Escribir un artículo educativo
- [D] Crear diagrama de conclusión
- [T] Evaluar mi conocimiento del video a través de un cuestionario de opción múltiple
- [I] Indicar marcas de tiempo
@davidsolano6967 3 года назад ⁺¹
Facil y sencillo explicando... Excelente amigo!
@kiddoquit 3 года назад
Muy útil y comprensible. Un gran trabajo!
@eduardomartinezmendoza3658 2 года назад
Gracias por tu aporte, excelente
@millerguev 3 года назад ⁺¹
Gracias por el tutorial, aprendí mucho
@CristinaRestrepoArango 2 года назад
Me aparece este error: Error in graph_from_data_frame(.) :
could not find function "graph_from_data_frame"
@hiddencix 3 года назад
holaaaaa, saben como puedo extraer el parrafo entero donde este la "palabra" que estoy busando?
@leoferreras1017 3 года назад
Para hacerlo con otro libro solo tenemos que cambiar el nombre???? O tenemos que utilizar otros códigos
@rubenurbizagastegui36 3 года назад
Mi estimado, el ultimo comando para producir la red de palabras produce el siguiente error: Error: Problem with `filter()` input `..1`.
x comparison (5) is possible only for atomic and list types
i Input `..1` is `n >= 3`.
Run `rlang::last_error()` to see where the error occurred.
Podrias hacerme el favor de indicarme como corregir ese error? Muchisimas gracias si pudieras.
@elenamorandini7020 3 года назад
Hola, estoy intentando replicar el ejemplo con un texto mío en italiano, pero al llegar a la tokenización R me da el siguiente error y no entiendo donde me he equivocado:
> unnest_tokens(word, sentence, drop = FALSE) #tokenise
Error in UseMethod("pull") :
no applicable method for 'pull' applied to an object of class "function"
¿Alguien me puede ayudar?
Gracias
@eltito1453 2 года назад
encontraste solucion? me pasa lo mismo pero en ingles

Следующие

Автовоспроизведение

Text analysis in R. Demo 1: Corpus statistics