Thank you for watching my latest video. ♥ If you liked this video, I'm sure you're also going to love my Data Cleaning Master Class. Clean your data and find insights much faster at data-cleaning.albert-rapp.de/
Very nice tips and details on the use of glue and SQL query!!! Thanks for explaining them. They are a bit tricky. It could have been quite a while until I figured out where the error is. What a nice optimized approach you used for this part of the filtering. Great video Albert!
What a timely video. I'm working on a project using Duck DB from R. I know that Duck DB supports enums, nested data, and pivots, but I have no idea how to use them with DBI or dplyr. I'll put examples in the replies.
Example #1: Nested data. Count how many people have data across multiple years for each state (I'm American). My dplyr code looks like this: my_data |> distinct(person_id, state, year) |> summarize( ids = list(person_id), ids_in_this_year = n(), .by = c(state, year) ) |> mutate( ids_in_next_year = map2_int(ids, lead(ids), \(x, y) sum(x %in% y)), ids_in_last_year = map2_int(ids, lag(ids), \(x, y) sum(x %in% y)), .by = state )
Example 2: Convert numerical csv data into factors/enums. The data come in as 1, 2, or 99 and need to have the labels "Yes," "No," and "Unknown." Again, I have a dplyr example, but I'm not sure how to do this conversion with database enums. my_data |> mutate( x1 = factor(x1, levels = c(1, 2, 99), labels = c("Yes", "No", "Unknown") )
Example 3: Pivots. Trying to use dbplyr just tells me there's no method for pivoting even though I know Duck DB has pivots con |> tbl("my_data") |> select(state, year, ids_in_this_year) |> pivot_wider( names_from = state, values_from = ids_in_this_year )
@@rappa753 I have heard of duckplyr. My understanding is that it works with data in memory. My data are larger than memory, so they have to sit in a duckdb file.
In my first year of data science i went through pain of writing a parser, build with paste XD this would have been easier. Although some time later i found out about the dbplyr package, which literally converts tidyverse syntax into sql syntax. quite nice if the queries aren't too complex
Thank you for watching my latest video. ♥
If you liked this video, I'm sure you're also going to love my Data Cleaning Master Class. Clean your data and find insights much faster at data-cleaning.albert-rapp.de/
Very nice tips and details on the use of glue and SQL query!!! Thanks for explaining them. They are a bit tricky.
It could have been quite a while until I figured out where the error is. What a nice optimized approach you used for this part of the filtering.
Great video Albert!
Nice! I'm glad that you find this so useful 🤗🥳
What a timely video. I'm working on a project using Duck DB from R. I know that Duck DB supports enums, nested data, and pivots, but I have no idea how to use them with DBI or dplyr. I'll put examples in the replies.
Example #1: Nested data. Count how many people have data across multiple years for each state (I'm American). My dplyr code looks like this:
my_data |>
distinct(person_id, state, year) |>
summarize(
ids = list(person_id),
ids_in_this_year = n(),
.by = c(state, year)
) |>
mutate(
ids_in_next_year = map2_int(ids, lead(ids), \(x, y) sum(x %in% y)),
ids_in_last_year = map2_int(ids, lag(ids), \(x, y) sum(x %in% y)),
.by = state
)
Have you seen the {duckplyer} package yet? 🤔
Example 2: Convert numerical csv data into factors/enums. The data come in as 1, 2, or 99 and need to have the labels "Yes," "No," and "Unknown." Again, I have a dplyr example, but I'm not sure how to do this conversion with database enums.
my_data |>
mutate(
x1 = factor(x1, levels = c(1, 2, 99), labels = c("Yes", "No", "Unknown")
)
Example 3: Pivots. Trying to use dbplyr just tells me there's no method for pivoting even though I know Duck DB has pivots
con |>
tbl("my_data") |>
select(state, year, ids_in_this_year) |>
pivot_wider(
names_from = state,
values_from = ids_in_this_year
)
@@rappa753 I have heard of duckplyr. My understanding is that it works with data in memory. My data are larger than memory, so they have to sit in a duckdb file.
In my first year of data science i went through pain of writing a parser, build with paste XD this would have been easier. Although some time later i found out about the dbplyr package, which literally converts tidyverse syntax into sql syntax. quite nice if the queries aren't too complex
Yeah dbplyr is nice but it can still be useful to use a bit of SQL manually with glue_sql(). But kudos for building a parser 🥳