Видео 63
Просмотров 108 455

Stream sharing with Pub/Sub using Analytics Hub

19:33

Seamless transition of Vector Search from BigQuery to Feature Store

34:56

Run Cloud Composer Locally

16:37

DBT Core on Cloud Run Job

39:26

How to build a sustainable data ecosystem on Google Cloud

29:59

A practical application leveraging Langchain and BigQuery Vector Search

44:30

Save 50 percent of your Data Engineering effort via Continuous Queries

Can #BigQuery Continuous Queries save your organisation 50% of data engineering effort?
I knew I was a bit late to the conversation, but I had thoroughly tested this feature to provide a well-rounded review. I had planned to dive deep into which use cases benefited the most, where the real time savings came from, and, most importantly, whether it was ready for production environments. I also intended to cover some of the key challenges I had encountered. Hope it's worth the wait!
A big shoutout to Nick Orlove for his incredible support and passion in driving the Continuous Queries feature in BigQuery. He’s been instrumental in gathering feedback, and even authored a fantastic article that’s...

Видео

Stream sharing with Pub/Sub using Analytics Hub

19:33

Stream sharing with Pub/Sub using Analytics Hub

Просмотров 422Месяц назад

In this video, I focus on the challenges I've faced and explain why this simple addition to Analytics Hub can be an extremely effective method for sharing streaming data. This is particularly beneficial in large organisations where multiple topics are published across various projects, and numerous subscribers belong to different teams. - 01:28 Quick intro to Analytics Hub - 02:19 Quick intro t...

Seamless transition of Vector Search from BigQuery to Feature Store

34:56

Seamless transition of Vector Search from BigQuery to Feature Store

Просмотров 412Месяц назад

Last week I found a Notebook created by Elia (Google) and Lorenzo (Google) that greatly simplifies transitioning Vector Search from BigQuery to Feature Store, enabling a smooth shift from offline to online serving with minimal code changes. Considering my last RAG video didn't cover online serving in depth, I think this is a perfect topic for a follow-up video. I'll demonstrate how easy it is t...

16:37

Run Cloud Composer Locally

Просмотров 5772 месяца назад

Google Cloud has introduced a command-line interface (CLI) for running an Airflow environment with Cloud Composer. This tool offers arguably the most convenient method for operating Airflow in a Composer-like setting for local development purposes. The significance of this tool lies in its comprehensive features for local development and its ability to easily incorporate additional Python packa...

39:26

DBT Core on Cloud Run Job

Просмотров 1,8 тыс.4 месяца назад

I'm thrilled to share my latest video: "DBT Core on Cloud Run Job" This tutorial is tailored for those looking to streamline their data transformation workflows in a serverless environment using Google Cloud's powerful service, Cloud Run. Whether you're a seasoned professional or just starting, this video has valuable insights for everyone. 🔹 What's Inside? - A step-by-step guide on setting up ...

How to build a sustainable data ecosystem on Google Cloud

29:59

How to build a sustainable data ecosystem on Google Cloud

Просмотров 6405 месяцев назад

Today, I’m sharing my experience on how to establish a data ecosystem within Google Cloud that addresses some significant challenges such as enhancing the speed of development, improving data management and sharing, ensuring quality, and most importantly, identifying clear methods to evaluate our progress. It’s essential to recognise that creating a lasting data system isn’t merely about adheri...

A practical application leveraging Langchain and BigQuery Vector Search

44:30

A practical application leveraging Langchain and BigQuery Vector Search

Просмотров 2,3 тыс.6 месяцев назад

Today, I'm thrilled to share insights into the integration of Langchain and BigQuery Vector search through a practical application that I've developed. This video presentation goes beyond theoretical discussion, offering a hands-on look at leveraging these cutting-edge technologies. I cover critical topics like Large Language Models (LLM), Embeddings, and Vector Search, which are fundamental to...

Scaling development teams with Cloud Workstations

26:13

Scaling development teams with Cloud Workstations

Просмотров 6227 месяцев назад

In today's rapidly evolving digital landscape, the need for flexible and secure development environments has never been greater. Enter Cloud Workstation, a game-changing solution that empowers development teams to harness the full potential of cloud computing, specifically within the Google Cloud Platform (GCP). In this video, we'll delve deep into the world of Cloud Workstation, exploring how ...

Privileged Just-in-time access on Google Cloud with JIT

44:59

Privileged Just-in-time access on Google Cloud with JIT

Просмотров 9337 месяцев назад

Just-In-Time privileged access is a method for managing access to Google Cloud projects in a more secure and efficient manner. It's an approach that aligns with the principle of least privilege, granting users only the access they need to perform specific tasks and only when they need it. This method helps reduce risks, such as accidental modifications or deletions of resources, and creates an ...

Real-time Analytics with Cloud Spanner CDC

37:29

Real-time Analytics with Cloud Spanner CDC

Просмотров 4638 месяцев назад

In the realm of relational databases, Cloud Spanner stands as a remarkable force, offering unparalleled horizontal scalability, reaching near-infinite capacity. Its 99.999% availability SLA across regions makes it a formidable contender for even the most demanding transactional workloads. Cloud Spanner's native support for CDC (change data capturing) through its "Change Stream" feature empowers...

Streaming data from BigQuery to Datastore using Dataflow

46:19

Streaming data from BigQuery to Datastore using Dataflow

Просмотров 1,1 тыс.9 месяцев назад

🚀 Today, I'm eager to discuss a method for using a Dataflow streaming pipeline to move data from BigQuery to Cloud Datastore.. At first glance, this approach might seem unconventional. Why? Because BigQuery isn't typically associated with streaming capabilities. However, I believe this strategy has immense potential. 🔍 Here's the context: a significant portion of our data now resides in BigQuer...

Serverless distributed processing with BigFrames

27:53

Serverless distributed processing with BigFrames

Просмотров 2,3 тыс.10 месяцев назад

Exciting news from Google Cloud with the launch of BigFrames (in preview). 🚀🚀🚀 This new library has significant potential to streamline processes that were traditionally managed by more intricate technologies like Apache Beam (Dataflow) or Spark. It also fills the gap between local Pandas operations running on Jupyter and deploying large-scale workloads in production, and enables faster interac...

Automated data profiling and quality scan via Dataplex

26:48

Automated data profiling and quality scan via Dataplex

Просмотров 7 тыс.11 месяцев назад

Data quality is a critical concern within a complex data environment, particularly when dealing with a substantial volume of data distributed across multiple locations. To systematically identify and visualise potential issues, establish periodic scans, and notify the relevant teams at an organisational level on a significant scale, where should one begin? This is precisely where the automated ...

Centralised Data Sharing using Analytics Hub

31:33

Centralised Data Sharing using Analytics Hub

Просмотров 2,9 тыс.Год назад

Sharing data in a medium - large organisation has always been a big challenge. In today's talk I've described some of these data sharing challenges I've seen over the past years in different organisations, and how the new Google Cloud product Analytics Hub can potentially solve this in a much easier and user friendly way in the analytics community. 01:50 - Data Sharing challenges 04:59 - What i...

BigQuery to Datastore via Remote Functions

22:20

BigQuery to Datastore via Remote Functions

Просмотров 1,6 тыс.Год назад

BigQuery Remote Functions has revolutionised the way we design data pipelines, eliminating the need for additional overhead between teams. Thanks to the exceptional efforts of the Unytics.io team, for creating Bigfunctions (unytics.io/bigfunctions/), a remarkable library collection bundled with Cloud Run deployment for BigQuery Remote Functions. With just a simple SQL query, we can seamlessly p...

Connect to services on another VPC via Private Service Connect (PSC)

24:08

Connect to services on another VPC via Private Service Connect (PSC)

Просмотров 8 тыс.Год назад

Connect to services on another VPC via Private Service Connect (PSC)

20:55

Cloud PubSub Multi-Team Design

Просмотров 970Год назад

Cloud PubSub Multi-Team Design

Snapshotting Data using PubSub and Datastore for Efficient API Serving

24:21

Snapshotting Data using PubSub and Datastore for Efficient API Serving

Просмотров 612Год назад

Snapshotting Data using PubSub and Datastore for Efficient API Serving

23:10

Cloud Run with IAP

Просмотров 6 тыс.Год назад

Cloud Run with IAP

Run Apache Spark jobs on serverless Dataproc

30:18

Run Apache Spark jobs on serverless Dataproc

Просмотров 4,3 тыс.Год назад

Run Apache Spark jobs on serverless Dataproc

31:21

GPG GCS File Decryptor on Cloud Run

Просмотров 1,1 тыс.Год назад

GPG GCS File Decryptor on Cloud Run

33:02

Is BigQuery Slot Autoscaling any good?

Просмотров 2,5 тыс.Год назад

Is BigQuery Slot Autoscaling any good?

Cloud Run PubSub Consumer via Pull Subscription

33:35

Cloud Run PubSub Consumer via Pull Subscription

Просмотров 5 тыс.Год назад

Cloud Run PubSub Consumer via Pull Subscription

26:53

Data-Aware Scheduling in Airflow 2.4

Просмотров 1,1 тыс.Год назад

Data-Aware Scheduling in Airflow 2.4

Improve Organisation Observability with Cloud Asset Inventory

24:14

Improve Organisation Observability with Cloud Asset Inventory

Просмотров 772Год назад

Improve Organisation Observability with Cloud Asset Inventory

32:20

Near real-time CDC using DataStream

Просмотров 6 тыс.Год назад

Near real-time CDC using DataStream

40:00

The 2022 Wrap-up

Просмотров 329Год назад

The 2022 Wrap-up

21:49

Run DBT jobs with Cloud Batch

Просмотров 1,2 тыс.Год назад

Run DBT jobs with Cloud Batch

Super Lightweight Real-time Ingestion Design

17:58

Super Lightweight Real-time Ingestion Design

Просмотров 1,1 тыс.Год назад

Super Lightweight Real-time Ingestion Design

27:06

Cloud Datastore TTL

Просмотров 1 тыс.Год назад

Cloud Datastore TTL

@listenpramod 2 дня назад
Thanks Richard, very well explained. I would say we need links for your videos on GCP docs on the respective topic. If I want to push events from onprem to a topic in gcp and then use Analytic Hub to share the stream to downsteam other GCP project, What do you think were the topics receive the events from on prem should be created
@practicalgcp2780 56 минут назад
No worries, have a look at this, it’s a blog published by Google on this topic, it has all the links you need cloud.google.com/blog/products/data-analytics/share-pubsub-topics-in-analytics-hub For sending events from onprem, I suggest create a new project dedicated for ingestions, and put the topic in there. It’s better this way because it is very unlikely consumers of the topic will be just one application, this makes it cleaner when you have multiple consumers from different applications / squads.
@fbnz742 2 дня назад
Hi Richard, thank you so much for sharing this. This is exactly what I wanted. I have a few questions: 1. Do you have any example on how to orchestrate it using Composer? I mean the DAG code. 2. I am quite new to DBT. I used DBT Cloud before and I could run everything (Upstream + Downstream jobs) or just Upstream, just Downstream, etc. Can I do it using DBT Core + Cloud Run? 3. This is quite off-topic to the video but wanna ask: DBT Cloud offers a VERY nice visualization of the full chain of dependencies. Is there any way to do it outside of DBT Cloud? Thanks again!
@practicalgcp2780 18 минут назад
No worries, happy it helped. DBT cloud is indeed a very good platform to run DBT core. There is nothing wrong using DBT cloud, in fact so many businesses still use DBT cloud mostly today. However that doesn’t mean it’s the only solution, or the best solution for all DBT use cases. For example, DBT cloud even as of today does not have a “deployment” concept, it relies on calling a version control SaaS for each run instead of using a local copy instead. Some caching has been implemented to prevent it from going down but it’s more of a workaround than a solution. This means for running mission critical applications needs a strict SLA, it maybe better to not use DBT cloud. Many companies for various reasons also can’t use SaaS like this, due to data privacy concerns or operation cost being too high as number of users grow. Which leaves with DBT core. You are right this approach does not give you an easy way to rerun the DBT job partially, however I think you can do this via the airflow dag parameter (the config json) which can be passed from the UI then passed to the cloud run job itself to have a way to deal with ad-hoc administrative tasks. The thing I like about cloud run other than k8 operators (which is another way to run DBT core on an airflow), is it’s an Google SDK, and serverless, much easier to test and control. If this isn’t an option you like, I recently came across github.com/astronomer/astronomer-cosmos I haven’t tried it but it looks quite promising. My concerns on this is mainly what does it mean when the version changes between DBT and airflow, how accurate is the mapping, or does it have compatibility issues with composer default package which in the past I had a lot of issues with hence the k8 executor solution being more popular. One thing worth mentioning is in my view, composer seems to be travelling in a direction of serverless, and decentralised, and it makes no sense to centralise everything on one singer cluster anymore, that means if you run cosmos on a dedicated cluster it might be a better option. But again, I haven’t tried it yet.
@s7006 11 дней назад
very through analysis ! one of the best RUclips channels out there on GCP product deep dives.
@practicalgcp2780 11 дней назад
Wow, thanks!🙏
@AdrianIborra 12 дней назад
Are you referring to Dbt Core as that not have VCS? From your point of view, does GCP one similar service like DBT that support a real world complex client system?
@sergioortega5130 15 дней назад
Excellent video, I spent hours figuring out where the run.invoke role needed to go, maybe I skipped it but is not mentioned anywhere in docs 🥲 Gold, thanks man!
@AghaOwais 25 дней назад
Hi My DBT code is successfully deployed to Google Cloud Run. I am using DBT Core not using DBT Cloud. The only issue is when I am hitting the URL, "Not Found" shows. I have identified the issue when code is running it keeps looking for dbt_cloud.yml but how can it be used when I am only using DBT Core. Please sort out. Thanks
@ItsMe-mh5ib 25 дней назад
what happens if your source query combines multiple tables?
@practicalgcp2780 25 дней назад
The short answer is most of these won’t work. At least during the public review. You can use certain sub queries if they don’t have keyword like Exists or NOT Exists. JOIN won’t work either. See the list of limitations here cloud.google.com/bigquery/docs/continuous-queries-introduction#limitations This makes sense because it’s a storage layer feature so it is very hard to implement things like listening to append logs on two separate tables together and somehow put them together. I would suggest focus on reverse ETL use cases which it’s mostly useful for the time being.
@ItsMe-mh5ib 25 дней назад
@@practicalgcp2780 thank you
@nickorlove-dev 25 дней назад
LOVE LOVE LOVE the passion from our Google Developer Expert program, and Richard for going above and beyond to create this video! It's super exciting to see the enthusiasm being generated around BigQuery continuous queries! Quick feedback up regarding the concerns/recommendations highlighted in the video: - All feedback is welcome and valid, so THANK YOU! Seriously! - The observed query concurrency limit of 1 query max for 50 slots and 3 queries max for 100 slots is an identified bug. We're in the process of fixing this, which will raise this limit and allow BigQuery to dynamically adjust concurrent continuous queries being submitted based on the available CONTINUOUS reservation assignment resources. - Continuous queries is currently in public preview, which simply means we aren't done with feature development yet. There are some really exciting items on our roadmap, which I cannot comment on in such a public forum, but concerns over cost efficiency, monitoring, administration, etc are at the VERY TOP of that list.
@practicalgcp2780 25 дней назад
Amazing ❤ thanks for the kind words and also the clarification on the concurrency bug, can’t wait to see it gets lifted so we can try it at scale!
@mohammedsafiahmed1639 26 дней назад
so this is like CDC for BQ tables?
@practicalgcp2780 26 дней назад
Yes pretty much, via SQL but more a reverse of CDC (reverse ETL in streaming mode if you prefer to call it that).
@mohammedsafiahmed1639 26 дней назад
thanks! Good to see you back
@practicalgcp2780 26 дней назад
😊 was on holiday for a couple of weeks
@SwapperTheFirst Месяц назад
thanks. It is trivial to connect from local vscode - it is just some small gcloud create tcp tunnel and this is it. though you're right that web browser experience is surpriningly good.
@practicalgcp2780 Месяц назад
Yup, I think a lot of the times when you have remote workforce it is easier to keep everything together so you can install plugins as well. That tunnel thing can work but a lot of times it’s just additional risk to manage and IT issues to resolve when something doesn’t work.
@user-wf5er3eo8v Месяц назад
This is a good content. I have a question. I have a uses case where i have a Data which has columns: "customers_reviews","Country","Year","sentiment". I am trying to create a chat bot where it can answer queries like: "Negative comments related to xyz issue from USA from year 2023." for this I need to filter the data for USA and for year 2023 with embeddings for xyz issue to be searched from the database. Which database will be suitable for this: Bigquery or Cloud SQL or Alloy DB. All these have the vector search capabilities. But need to look for most suitable and easy to understand. Thanks
@practicalgcp2780 Месяц назад
One important thing to understand is the difference between database suitable for highly concurrent traffic (b2c or consumer traffic) vs b2b (internal or external business has small amount of users). BigQuery can be suitable for b2b when the amount of users using it at the same time peak, is low. For all b2c traffic you never want to use BigQuery because it’s not designed for such thing. There are 3 databases on GCP can be suitable for b2c traffic, and all of them supports highly concurrent workload. Cloudsql, alloydb and vertex feature store vector search if you want serverless. You can use any of the 3, whichever you are more comfortable with, vertex feature store can be quite convenient if your data is in BigQuery, a video I create recently might give you some good ideas on how to do this ruclips.net/video/QIZwwCmEhzI/видео.html
@adeolamorren2678 Месяц назад
One seperate question; if we have dependencies, since it's a serverless environment, we should add the dbt deps command in the dockerfile args, or runtime override args right?
@practicalgcp2780 Месяц назад
No I don’t think that is the right way to do it, serverless environments you can still package up dependencies, and this is something you typically need to do in build time not run time, I.e, while you are packing the container in your CI pipeline. DBT can generate a lock file which can be used to ensure packaging consistency on versions, so you don’t end up having different versions each time you run the build. See docs.getdbt.com/reference/commands/deps The other reason you don’t want to do that at run time is it could be very slow to install dependencies each time because it requires downloading these, plus you may not want internet access on a production environment to be more secure in some setups so doing this in build time makes a lot more sense
@adeolamorren2678 Месяц назад
with this approach is it possible to add environment variables that are isolated for each run? I basically want to pass environment variables for each run when I invoke google cloud run
@practicalgcp2780 Месяц назад
Environment variables are typically not designed for manipulating runtime variables each time, these are typically set for each environment, and stick to each deployment not run. But it looks like both options are possible, and stick to passing command line arguments because that’s more appropriate to override compared to environment variables. See this article on how to do it, it’s explained well chrlschn.medium.com/programmatically-invoke-cloud-run-jobs-with-runtime-overrides-4de96cbd158c
@aniket-kulkarni Месяц назад
After researching so much on this topic, finally, a video that explains clearly especially motivations and the problem that we are going to solve with PSC.
@practicalgcp2780 Месяц назад
Comments like this is what keeps me going mate ❤ thanks for the feedback
@ritwikverma2463 Месяц назад
Thank you Richard for great GCP tutorials, please continue making these GCP video series.
@practicalgcp2780 Месяц назад
Thanks, will do! Glad you liked these.
@42svb58 Месяц назад
thank you for posting these videos!
@practicalgcp2780 Месяц назад
My pleasure!
@dollan1991 Месяц назад
I can't find it now, but I remember reading an GCP Issue Tracker that stated that the sync will always take 5min+ due to resources that needs to be provisioned in the background
@practicalgcp2780 Месяц назад
I guess for daily workload it’s ok. These things don’t typically need to be that up to date for most use cases. I do want to try that continues mode though, which is more likely designed for real time sync
@travisbot1414 2 месяца назад
Awesome videos there are awesome, you should make courses that cover the content for courses $$$$$$$$
@practicalgcp2780 2 месяца назад
Haha, thanks. It’s more important to share knowledge for free so more companies can adopt Google cloud and make it work better and hope it will become the no. 1 cloud provider 😎 maybe one day in the future I will make a course.
@johnphillip9013 2 месяца назад
@@practicalgcp2780thank you so much
@AI0331 2 месяца назад
This is really an amazing video. especially the trouble shooting part. very clear😊 Love it!!
@practicalgcp2780 2 месяца назад
Glad it helped!
@yinliu5471 2 месяца назад
I like this video, it is the most informational and practical video for the topic IAP. Thanks for sharing
@practicalgcp2780 2 месяца назад
Glad it was helpful!
@ritwikverma2463 2 месяца назад
Hi Richard, can we create dataproc serverless job in different gcp project using service account?
@practicalgcp2780 2 месяца назад
I am not sure I understood you fully, but service account can do anything in any project regardless which project the service account is created from. The way it works is by granting the service account IAM permission from the project you want the job to be created. Then it will work. But it may not be best way to do it as that one service account may have too much permission and scope. You can use separate service account, one for each project if you want to reduce scope, or have a master one to impersonate as other service account in those project but keep in mind it’s key to reduce scope of what each service account can do, otherwise when there is a breach, it can be massive damage on everything all together.
@HARDselection 2 месяца назад
As a member of a very small data team managing a complex orchestration workload, this is exactly what I was looking for. Thanks!
@practicalgcp2780 2 месяца назад
Glad it was helpful!
@nishantmiglani7021 2 месяца назад
Thanks a lot, Richard He, for creating this insightful video on Analytics Hub.
@Iyanu-eb2eh 2 месяца назад
how do you know if the sql table is actually connected?
@practicalgcp2780 2 месяца назад
Sorry it’s been a while since I created this, if it works it is connected right? Am I missing something?
@agss 2 месяца назад
Thank you for the very insightful video! What is your take on using Dataform instead of DBT, when it comes to capabilities of both tools and ease to deploy and manage those solutions?
@practicalgcp2780 2 месяца назад
Thank you and spot on question, I was wondering who is going to ask this first 🙌 I am actually making a Dataform video in the background but don’t want to public it unless I am 100% sure I am saying something useful. But based on my current findings, you could use either and depends on what you need both can be a good fit. Dataform is a lot easier to get up and running but it’s quite new and I won’t recommend using it for something too critical at this stage, and it’s also missing some key features like templating using jinja (I don’t really like the JavaScript templating system, as it’s built on typescript, that is something no one uses, you would be lock-in to something with no support which in my view is quite dangerous). But it is something a lot easier to get up and running natively in gcp. DBT is still the go to choice in my view, because it is built in Python has a strong open source community. For mission critical data modelling work, I still think DBT is much better.
@agss 2 месяца назад
@@practicalgcp2780 you brought up exactly what I was worrying about. I highly appreciate your insight!
@strmanlt 18 дней назад
Our team was debating migrating from dbt to Dataform. Dataform is actually is a pretty decent tool, but the main issues for us was the 1000 node limit per repo. So maybe if you have very simple models that do not require a lot of notes it would work fine, but for us the long term scalability was the deciding factor
@practicalgcp2780 18 дней назад
@@strmanlt thanks for the input on this! Can I ask what is the 1000 node you are referring to? Can you share the docs on this. Is it 1000 node limit on number of steps / sql you can write?
@fbnz742 2 дня назад
Just wante to share my thoughts here: I used Dataform for an entire project and it worked quite well. My data model was not so complex, and I learned how to integrate its logs with Airflow, being able to set up alerts to Slack, pointing to the log file of the failed job, etc, however, I agree that Dataform templating is very strange. I personally don't have expertise with JavaScript so suffered a lot with some things, but I was able to do pretty much all I wanted. I suffered a lot with looking for things in the internet, and DBT is the exact opposite: you can find tons of content online. I would go with DBT.
@Rising_Ballers 2 месяца назад
Hi Richard, Love your content, always wanted someone to do GCP training videos emphasizing real world use cases, I work in Bigquery and Composer, I wanted to learn dataproc and dataflow. But everywhere i see same type of trainings not much focusing on real world implementations. I wanted to learn how dataproc and dataflow jobs are deployed in different environments like dev test and prod, your videos are helping a lot, hope you will do more videos on dataflow and dataproc, how we use this in real projects in how we create these jobs using CICD
@practicalgcp2780 2 месяца назад
No worries glad you found this useful ❤
@Rising_Ballers 2 месяца назад
@@practicalgcp2780 I have one doubt, in an organization if we have many dataproc jobs how will we create it in different environments like dev test and prod, can you please do a video on that
@ayoubelmaaradi7409 3 месяца назад
🤩🤩🤩🤩🤩
@ap2394 3 месяца назад
Thanks for detailed video. Can we have scheduling at task level ? Eg : if have 2 task in downstream DAG and both are different on different dataset. Can I control the schedule at task level ?
@practicalgcp2780 21 день назад
Just realised I never replied to this one, my apologies. I am not sure that is the right way to think about how this works. Regardless which task it is, or which dag, it’s about listening to a change event from something got triggered in the upstream dataset, then react to that event. As long as you design dags in a way it is the right behaviour to trigger a dag, based on a change event, then it will work.
@viralsurani7944 3 месяца назад
Getting below error while running pipeline with DirectRunner. Any idea? Transform node AppliedPTransform(Start Impulse FakePii/GenSequence/ProcessKeyedElements/GroupByKey/GroupByKey, _GroupByKeyOnly) was not replaced as expected.
@DExpertz 3 месяца назад
I appreciate this video Sir, 😍 (Subscribed and liked) will share too with my team.
@practicalgcp2780 3 месяца назад
Thanks so much for you support ❤
@DExpertz 3 месяца назад
@@practicalgcp2780 Of course man, thank you for sharing this informations in a simpler way
@10xApe 4 месяца назад
Can cloud run be used for Power BI datarefresh gateway ?
@practicalgcp2780 3 месяца назад
I haven’t used power BI so I googled what is data refresh gateway, according to learn.microsoft.com/en-us/power-bi/connect-data/refresh-scheduled-refresh it looks like it’s some sort of service you can control refresh via a schedule? Unless there is some sort of API it allows you to trigger from the Google Cloud ecosystem I am not sure if you can use it. I assume you are thinking of triggering some DBT job first then refresh the dashboard?
@adityab693 4 месяца назад
In my org, They have to get an exception and analyticshub.listing.subscribe role is not available. Also data can be shared within org vpc, how about sharing outside of vpc?
@SamirSeth 4 месяца назад
Simply the best (and only) clear explanation of how this works. Thank you very much.
@practicalgcp2780 4 месяца назад
Glad it helped!
@QuynhNguyen-zy2rs 4 месяца назад
Hi, After you have created data profile scan and data quality scan, is the insights tab displayed? I don't see the insights tab in your video. Please explain to me! Thanks!
@alifarah9 4 месяца назад
Really appreciate these high quality videos ! Seriously your videos are better than the official video for GCP. What makes these videos invaluable is you teach frok first principles and talk about problem that will be faced in any cloud environment not GCP.
@practicalgcp2780 4 месяца назад
Thank you so much 🙏 you are right the principal are very much the same, no matter which cloud provider it is. Although my focus is GCP because it is something I believe as an ecosystem it’s much more powerful but remains the easiest to implement and scale compares to other cloud providers.
@anantvardhan1212 4 месяца назад
Amazing explanation! However, I have a doubt regarding the use of OAuth 2.0 creds in this whole setup. Does the OAuth client ID represent the backend service here, which is delegating authentication to IAP?
@practicalgcp2780 4 месяца назад
Thank you and I don't think this was explained well in the video. I did some more reading and one thing I noticed here is the docs here on how to create the backend service of LB has changed cloud.google.com/iap/docs/enabling-cloud-run#enabling. As you can see at 15:08 in the video it use to require the client_id and client_secret to create the backend to enable IAP, but that doesn't seem to be there anymore. The latest docs has a note saying "The ability to authenticate users with a Google-managed OAuth client is available in Preview.". Well technically if it's in preview it should not update the docs to remove this option but if it is true then it means by default it will use the google managed oauth client and creating the credentials manually is no longer required. I've not tested this out yet but I think it's worth trying it without using a custom credential and just enable IAP. I think it makes sense as creating it manually and then specify is a lot faff as you need to manage the secret rotation etc yourself.
@practicalgcp2780 4 месяца назад
And my understanding the way this works is when a user comes in, the user will pass the auth header, the load balancer backend will intercept and use IAP to do the verification to see if the user has permission or not which is defined in IAM with the user group. Because the IAP SA has been granted the invoker access to the cloud run service, hence user will be granted access after passing through the IAP validation
@harshchoudhary6069 4 месяца назад
How we can share the authorized view using analytics hub?
@practicalgcp2780 4 месяца назад
It makes no difference using authorised views, as authorised view permissions are managed the same way as tables, different to normal views. However, using authorised views has some tradeoffs, a key one being losing metadata such as column descriptions which isn’t great for data consumers. But it does have the advantage if you don’t want to duplicate data models or increase latencies
@LongHD14 4 месяца назад
May I ask one more question regarding this matter? I would like to implement permissions for a chat application concerning access to documents. For example, a person should only have access to certain tables or specific fields within those tables, and if they don't have permissions, they wouldn't be able to search. Do you have any suggestions or keywords that might help with this issue? Thank you very much for your assistance
@practicalgcp2780 4 месяца назад
That is something you have to do through some sort of RBAC implementation (role based access control). That isn’t anything to do with the search, it’s more on mapping out the role of a user through logging in like what most applications do today. Then depends on the role, you can add specific filters in the search queries, like filtering via certain metadata or have a set of tables you can restrict based on roles etc.
@LongHD14 4 месяца назад
Sure, I understand that. However, I'm looking for a service that can assist me with implementing RBAC.
@practicalgcp2780 4 месяца назад
ok I see. I think it really depends on what you are using. For example, if you are building a backend with Python. You can use Django which has a RBAC module, but generally any framework would have some sort of RBAC component you can use. If it’s an internal app (like for within the company use) then you can simplify things by just using IAP, but IAP isn’t suitable for external consumer applications
@LongHD14 4 месяца назад
@@practicalgcp2780 thank you for your answer!
@kavirajansakthivel 5 месяцев назад
Hello Richard, It was wonderful video, but somehow i couldn't setup the tcp proxy, how did you do it ? through reverse proxy method or auth proxy mehtod ? I see you are the only successful person who has done this so far ? could you please create a tutorial video for the same ?
@practicalgcp2780 5 месяцев назад
Hi there, it’s been a while since I did it last time, it’s gonna be quite difficult to understand what your problems are as it’s a quite complex setup. If I remember correctly this is the documentation I followed cloud.google.com/datastream/docs/private-connectivity#reverse-csql-proxy, make sure you follow this step by step, especially don’t forget to open the required firewall rules as this can be a common cause.
@LongHD14 5 месяцев назад
Wow, this video is incredibly insightful and informative! 👏 I've learned so much and am grateful that you've shared this valuable content with us. Just a quick question: Could I apply these concepts to create a conversation app that details the findings in the search results? Looking forward to your guidance on this.
@practicalgcp2780 5 месяцев назад
Glad you found it useful! I don’t see why not, but as I mentioned, for a conversational app, I assume you are talking about an app that is consumer (real customers) facing, the concept is exactly the same but you need to change the vector db to something that supports highly concurrent workload. So BigQuery is out of the picture, you can look at vertex AI vector search and also AlloyDB which I am hearing a lot lately. I haven’t tried either yet but as far as I know they are both valid approach for consumer apps supports highly concurrent workload. The docs for alloyDB is here cloud.google.com/alloydb/docs/ai/work-with-embeddings
@LongHD14 5 месяцев назад
Thank you for your valuable insights and guidance!
@practicalgcp2780 5 месяцев назад
You are welcome ;)
@kamalmuradov6731 5 месяцев назад
I implemented a similar solution using Cloud Workflows (CW) + Cloud Functions (CF). The CW runs a loop and makes N requests to the CF in parallel each iteration, where N is equal to the CF’s max instances. I’ll look into querying Stackdriver each loop to dynamically determine concurrency. I chose CW over Cloud Scheduler (CS) for a few reasons. First, CS is limited to at most 1 run per minute, which wasn’t fast enough to keep my workers busy (they process a batch in under 30 seconds). Second, CS can’t make N requests in parallel so would required something in between to replicate the CW is doing. Third, CW has a configurable retry policy which is handy for dealing with the occasional CF network issues. One caveat with CW is that a single execution is limited to 100k steps. To workaround this issue, I limit each CW execution to 10k loops, at the end of which it triggers a new workflow execution and exits. I setup an alerting policy to ensure there is always exactly 1 execution of this workflow running and haven’t had any issues.
@practicalgcp2780 5 месяцев назад
Hmm, interesting approach. Although I am not sure we are comparing apples and apples here. The solution demonstrated in this video is an always on approach. In other words, the pull subscriber is always on listening to the PubSub subscriber, it doesn’t die after processing all remaining messages, but simply waits. So if you change the interval of Cloud Scheduler to 10 minutes, and let the pull subscriber run for 9 minutes 50 seconds for example, it will not get killed until it reaches to that timeout (which is in the code example I gave). I am not sure if I misunderstood you here, but the solution here is no different to what you would normally do with a GKE deployment, it’s just an alternative without needing any infrastructure.
@kamalmuradov6731 5 месяцев назад
That sounds correct! In my case the CF does a “synchronous pull” of a few thousand messages, processes them, and acks them all in bulk. So it’s not an always-on streaming setup like what you demoed here. It handles 1 batch per request, shuts down, and then is invoked again in the next loop by the CW. For this particular use case, batching is advantageous so I went with synchronous pull. But it would be straightforward to switch the CF to a streaming pull if batching was not necessary.
@stevenchang4784 5 месяцев назад
vscode is browser base environment , but a lot of users restricts cloud shell and ssh connection. Do you think cloud workstation could bypass those restrictions.
@practicalgcp2780 5 месяцев назад
Hi there, I haven’t used this at scale yet but my understanding is one of the most important reasons for having cloud workstation is to get around these restrictions. The common reason cloud shell does not work in many org is because the inability to support private IP and VPC SC, but this isn’t the case for workstations as these are deployed within your network. Check this out and it’s documented here cloud.google.com/workstations
@stevenchang4784 5 месяцев назад
@@practicalgcp2780 Hi, I tested all day. Thank you for your reply. It really solve the cloud shell's public IP issue.
@jean9174 5 месяцев назад
😎 "Promosm"
@digimonsta 5 месяцев назад
Really interesting and informative. I'm currently looking at migrating away from a GKE workload purely because of the complexity, so this may prove useful. I'd be interested to know if you feel Cloud Run Jobs would support my use-case? Essentially, based on a Pub/Sub message, I need to pull down a bunch of files from a GCS bucket, zip them into a single archive and then push the resulting archive back into GCS. This zip file is then presented to the user for download. There could be many thousands of files to ZIP and the resulting archive could be several terrabytes in size. I was planning on hooking up FileStore or GCS FUSE to Cloud Run to facilitate this. The original implementation was in Cloud Run (prior to jobs), but at the time, no-one knew how many files users would need to download or how big the resulting zip files would be. We had to move over to GKE, as we hit the maximum time limits allowed for Cloud Run, before it was automatically terminated.
@practicalgcp2780 5 месяцев назад
Thanks for the kind comment. And it’s a quite interesting problem because the size of the archive can be potentially huge so it can take a long period of time. You are right. Cloud run service I think even today can only handle up to 1 hour timeout, cloud run job can handle 24 hours now. So if your archive process won’t take longer than a day i don’t see why you can’t use this approach. If you need longer time you can look at cloud batch, that can run longer without needing to create a cluster but it’s more complex the track the state of the operation. I have another video describing use cases using batch. Having said that, it feels a bit wrong to have archives of huge size like that, have you considered options to generate the PubSub message from upstream systems in smaller chunks, or use the cloud run service to break things down and only zip so much files in a single execution, track the offset somewhere (I.e in a separate PubSub topic) to trigger more short lived zip operations? The thing is if there’s a network glitch which happens every now and then you could have wasted huge amount of compute. Personally I would always prefer to make the logic slightly more complex in the code than maintaining a GKE cluster myself just to keep the infrastructure as simple as possible but that is just my opinion.
@user-dl5mm9fu9g 5 месяцев назад
The introduction is very detailed and very good，Good Job，buddy...
@practicalgcp2780 5 месяцев назад
Thank you! Its great you found it useful
@eyehear10 5 месяцев назад
this seems like it adds more complexity than compared with the push model
@practicalgcp2780 5 месяцев назад
Yes it does, but not everything upstream supports push model, plus not every downstream can handle the load via the push model. I explained some of the pros and cons, mainly related to controlling or improving throughput (I.e limiting how much traffic you want to consume if there is too much traffic or use batching). A really important thing to consider is the downstream and how many connections you establish, or how many concurrent requires you make if, I.e the downstream system is a HTTPS endpoint. Opening too many requests can easily overwhelm the system on the other side where if you batch the request or just open a single connection and reuse it makes a huge difference. If it’s possible to use push without the constraints above, it’s almost always better to use push. Hope that makes sense
@SwapperTheFirst 5 месяцев назад
Any examples of such tools for cataloging, certification and lineage? Especially OSS? I had some experience with Qlik Catalog, but not sure if this is a good choice to GCP and how well it is integrated with BQ. Beyond usual suspects (Collibra, Immuta, ...)
@practicalgcp2780 5 месяцев назад
There are a few who are GCP partners has very good integration with GCP to save you a lot of time doing meta data integration by engineers. Collibra is one of them as you already mentioned, you can also look at Atlan, a new player in the field but has some powerful features too. That’s the two I am aware of in my view have pretty good integration and features but please do your own research there are pros and cons and these are not recommendations I am making here. OSS do you mean support systems like JSM?
@SwapperTheFirst 5 месяцев назад
@@practicalgcp2780 nope, I mean open source software, like Apache Airflow for workflow management. From which you can also make managed solutions, like Astronomer or Cloud Composer. I think something should exist in this space too?
@SwapperTheFirst 5 месяцев назад
I like this format of battle stories/coaching.
@practicalgcp2780 5 месяцев назад
Thanks ☺️ thought might try a different way to present feels like more people can relate to this
@WiktorJurek 5 месяцев назад
This is bang on. It would be awesome to see how this works in practice - as in, how all of this looks in the console, how to set it up, and practically how you can oversee/manage this kind of setup.
@practicalgcp2780 5 месяцев назад
There’s quite a lot of effort involved but the foundation isn’t that difficult to setup. But it’s not like there is just some sort of UI everything can be done there, I think the entry point of data management and discovery for large group of users can be from the catalog tool, and a platform team can own the tooling for things like quality scan and analytics hub while making them self service. There are things especially like the data quality check rules I would prefer to keep these in version control so it’s much easier to control the changes and quality of the checks where as other things like analytics hub UI should be sufficient as long as there is a way to recovery if something goes wrong

PracticalGCP

Видео

Комментарии