What do you think about Datastream for your data capture and replication needs? Let us know in the comments below and subscribe for more Google Cloud tips and tricks → goo.gle/GoogleCloudTech
I managed to play around with Datastream with BigQuery as destination. The problem with this approach is that the tables created are not partitioned. For those of as who do incremental load from our BigQuery replica to our reports, will always have to scan the whole table which comes with a cost, compared to scanning and querying only new data in the BigQuery replica
if it works as described, it is really COOOOOOL. thanks a lot. @Gabe Weiss, some questions: 1. any limitations for Datastream for BigQuery? 2. I am using Cloud SQL so it would be great to have a tutorial for this combination. 3. Looks like AlloyDB competitor, isnt it?) what are the core differences? (I am thinking about AlloyDB in new verison of our project to avoid streaming analytic data to Bigquery)
Can I stream a subset of columns from my source? the cli help (gcloud datastream streams create --help)suggests yes, but when i specify mysql_columns in the suggested format gcloud errors out with ERROR: gcloud crashed (ValidationError): Expected type for field column, found {} (type )
Through the GUI, when selecting the source objects to replication i can use wildcards such as "*.mytable". How do i do this with the CLI? When i describe a stream created through the GUI (gcloud datastream streams describe) the database field is simply missing, but when i try to create a new stream using the same format gcloud bombs out with "ERROR: gcloud crashed (ParseError): Cannot parse YAML: missing key "database"."
Hi@@etaimargolin8449 , I found out that some rows were duplicated in the destination table in BigQuery. I cannot delete any rows of the table. How can I solve it?
Yes, many schema changes are automatically detected and supported, but some changes might not be detected automatically and may result in events from that table being dropped. In this case, Datastream will report the reason for the event(s) being dropped, and any missing data can be recovered using a backfill of the table.
@@vitamin2732 There is already established process where outputs are stored in BQ, now we needed to send outputs to CloudSQL for API consumption. We need both outputs one which is stored in BQ for analytical reporting and other one for realtime usage through API. Hope it make sense So wondering how to get realtime streaming from BQ to CloudSQL tables, having automatic CDC feature
What do you think about Datastream for your data capture and replication needs? Let us know in the comments below and subscribe for more Google Cloud tips and tricks → goo.gle/GoogleCloudTech
Is there any solution for partitioning data with datastream?
I managed to play around with Datastream with BigQuery as destination. The problem with this approach is that the tables created are not partitioned. For those of as who do incremental load from our BigQuery replica to our reports, will always have to scan the whole table which comes with a cost, compared to scanning and querying only new data in the BigQuery replica
Perhaps cli or programmatic access (not shown in the UI) will allow for specifying a partition key
@@terminalrecluse I checked the CLI as well, but I couldn’t find it there. Is there any solution for partitioning data? @googlecloudtech
Currently partitions aren't leveraged for optimizing the size of the table scan. We are looking into implementing this as a future improvement.
if it works as described, it is really COOOOOOL. thanks a lot. @Gabe Weiss, some questions: 1. any limitations for Datastream for BigQuery? 2. I am using Cloud SQL so it would be great to have a tutorial for this combination. 3. Looks like AlloyDB competitor, isnt it?) what are the core differences? (I am thinking about AlloyDB in new verison of our project to avoid streaming analytic data to Bigquery)
Hey! on 2:31, do you have example on how did you create or set up that connection? Thanks.
Can I stream a subset of columns from my source? the cli help (gcloud datastream streams create --help)suggests yes, but when i specify mysql_columns in the suggested format gcloud errors out with ERROR: gcloud crashed (ValidationError): Expected type for field column, found {} (type )
Yes, a subset of columns is supported, in both the UI and API. There was a bug in gcloud around this capability, it should be fixed now.
what if there are records updated/deleted in source system(mysql) does it also perform update/delete in bigquery or it works in append only mode.
Datastream replicates all UPDATE / INSERT / DELETE operations to the destinaton. Support for append-only mode is planned for a future release.
Is there a way to let CloudSQL IAM user (or a ServiceAccount User) be accepted as a way to connection to the CloudSQL db?
Not currently sadly, no. BUT, it's something we're thinking about. No promises on timeline, but it's definitely something we're working on adding.
Through the GUI, when selecting the source objects to replication i can use wildcards such as "*.mytable". How do i do this with the CLI? When i describe a stream created through the GUI (gcloud datastream streams describe) the database field is simply missing, but when i try to create a new stream using the same format gcloud bombs out with "ERROR: gcloud crashed (ParseError): Cannot parse YAML: missing key "database"."
Yes, this is supported - you need to specify an empty database key ( "database": "" )
What happens if I accidentally delete destination table in BigQuery? How can I restore the table?
Datastream will recreate the table automatically, and the data that was deleted can be recovered by trigerring a backfill from the source.
Hi@@etaimargolin8449 , I found out that some rows were duplicated in the destination table in BigQuery. I cannot delete any rows of the table. How can I solve it?
When will postgres cloudsql datastream be available?
it’s also available now 🎉
cloud.google.com/datastream/docs/sources-postgresql
Wowowoqoqoqo great news!!!
can MariaDB be used instead of MySQL as source to stream data to bigquery?
Yes it can! See here for supported versions of MySQL supported: cloud.google.com/datastream/docs/faq#behavior-and-limitations
Its already en preview for Postgres??? 😮😮
Does it support Customer managed encryption keys ?
Datastream supports CMEK for data stored at rest. Support for CMEK on data loaded to BigQuery will be added shortly.
will it accept Schema changes ?
Yes, many schema changes are automatically detected and supported, but some changes might not be detected automatically and may result in events from that table being dropped. In this case, Datastream will report the reason for the event(s) being dropped, and any missing data can be recovered using a backfill of the table.
Why
Can this feature be used to load from Bigquery to CloudSQL(Postgres) and have realtime streaming for operational purposes.
@googlecloudtech
why do you need it?
@@vitamin2732 There is already established process where outputs are stored in BQ, now we needed to send outputs to CloudSQL for API consumption. We need both outputs one which is stored in BQ for analytical reporting and other one for realtime usage through API.
Hope it make sense
So wondering how to get realtime streaming from BQ to CloudSQL tables, having automatic CDC feature
@@rameshyadav1723 it looks like wrong architecture.... normally you need to stream from cloud SQL to BQ
@@rameshyadav1723 I think Datastream latency is too slow to be used for realtime transaction API