Change Data Capture — Real-Time Data Transfer to BigQuery with GCP DataStream
Hello, I haven’t been sharing posts here for a long time, but now I will make an effort to write little by little. I wanted to start the new year quickly. From now on, I will publish my posts in both English and Turkish. Anyway, let’s move on to today’s important topic. Today, I will try to explain how to perform Change Data Capture (CDC) using Datastream on GCP and how to send data to BigQuery as the target.
First of all, what is Datastream?
GCP Datastream is a serverless and user-friendly service designed to facilitate the flow of data from relational databases. While it currently may not support all relational databases, it does support significant databases, such as MySQL, PostgreSQL, AlloyDB, and Oracle. The service reads potential changes (known as Change Data Capture or CDC in the literature) in these databases, transferring data to various destinations like BigQuery, Cloud SQL, Cloud Storage, and Cloud Spanner.
Advantages of Datastream
To enable low-latency ELT (Extract, Load, Transform) pipelines for nearly real-time insights in BigQuery, Datastream offers a seamless setup of data flow processes. As a serverless and fully managed service, it autonomously handles its operations and scales up or down automatically with minimal downtime.
Datastream provides direct integration with key services within GCP, such as Storage, BigQuery, and SQL.
It facilitates the synchronization and consolidation of data flows between heterogeneous databases and applications.
Now, let’s dive into the application points and some use cases for Datastream.
Imagine you have a sales table stored in PostgreSQL or Oracle that receives constant updates and inserts. In this scenario, real-time processing of this data is crucial because an important department needs timely reports. Datastream comes to your rescue in such situations. CDC is designed precisely for these scenarios, and Datastream facilitates it. Every change in the Order table is streamed to Datastream through database logs and then sent to BigQuery.
Since this write-up focuses entirely on Datastream integration, I won’t delve into the necessary VPC settings here. However, it’s essential to note that for databases within the company’s internal network or on other cloud platforms, a tunnel must be established through GCP. Details regarding this can be found online.
Let’s move on to the explanation with visuals.
To enable Datastream to read data from the source database, some configurations need to be made, such as replication, publication, and setting permissions on tables.
- Publication: It is recommended to create a publication only for the tables you want to replicate. This allows Datastream to read only the relevant data, reducing the load on both the database and Datastream. However, you can define a publication for all tables in a schema as well. Here are some sample codes:
If you want to create a publication for specific tables,
CREATE PUBLICATION PUBLICATION_NAME
FOR TABLE SCHEMA1.TABLE1, SCHEMA2.TABLE2;
If you want to grant it to all tables under a schema;
CREATE PUBLICATION PUBLICATION_NAME
FOR TABLES IN SCHEMA1, SCHEMA2;
If you want to do this for all tables in the database;
CREATE PUBLICATION PUBLICATION_NAME FOR ALL TABLES;
After creating the publication, it’s time to move on to creating replication, where you can consider the configuration settings that will handle the flow for you. To create replication;
SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('REPLICATION_SLOT_NAME', 'pgoutput');
After completing this process, let’s move on to the user and permissions that DataStream needs to use. Below, you can find information on how to create a user and grant the necessary privileges.
CREATE USER USER_NAME WITH REPLICATION LOGIN PASSWORD 'USER_PASSWORD';
ALTER USER USER_NAME createdb;
GRANT SELECT ON ALL TABLES IN SCHEMA SCHEMA_NAME TO USER_NAME;
GRANT USAGE ON SCHEMA SCHEMA_NAME TO USER_NAME;
ALTER DEFAULT PRIVILEGES IN SCHEMA SCHEMA_NAME GRANT SELECT ON TABLES TO USER_NAME;
After completing all these configurations, we can move on to the Datastream interface.
When you first enter Datastream, it welcomes you as shown below. Of course, initially, if you haven’t activated the API, you should see that first. After activating it, you should be greeted with a screen like the one below.
Firstly, we need to create private connectivity, then a connection profile, and finally, testing is required to initiate the stream.
For private connectivity, we need it because we will access a database outside of GCP, and we require a tunnel between the database and GCP. We make the necessary adjustments to use this tunnel with Connectivity. In the VPC section, we created a VPC Network on the GCP side, and this network has two subnets. Of course, these subnets can be created for other services to use; Datastream takes care of its peering subnets and doesn’t ask you to create them. Here, we defined a demo network named “poc” for the VPC to be used. Also, you need to provide an IP range.
After successfully creating the Connectivity, the next step is in the Connection Profile section.
You should select the database for which the profile will be created.
Here, we select PostgreSQL.
From this point on, you need to enter information related to the database. Give your profile a name and ID, then choose the region. After that, specify the database to which you will connect.
After entering this information, you will need to choose the recently created private connectivity as shown below. Choose “VPC Peering” as the Connectivity method.
After successfully passing this setup, you can now try accessing the database. This test will demonstrate that the tunnel structure between GCP and PostgreSQL has been successfully established. If the test passes, then the setup is complete.
Now, let’s move on to the stream process. When you reach the stream process, you will be presented with a 6-step path. In the image below, you can see the details of the first step and the relevant 6 steps on the left side. The crucial points for the first step are to select the region, source, and destination correctly.
In the 2nd step, you need to select the connection profile that you created.
In the 3rd step, you will enter the source information. You need to input the replication and publication names that I shared at the beginning of the text, and then select which tables, etc., will be included.
In the 4th step, you need to set up the destination, which is BigQuery. Datastream automatically creates the table, so no additional configuration is required.
Exactly, in the 5th step, in the Configuration section, you have the flexibility to decide whether to create a separate dataset for each schema or have all tables under the same schema. The choice is yours based on your preference.
In the 6th and final step, Datastream provides a summary of all the configuration settings for your review. You can go through the summary, and if all settings have a checkmark or appear in green, then your Datastream service is ready for Change Data Capture (CDC).
After clicking Create, you can start seeing your tables in BigQuery approximately 5–10 minutes later.
I would like to provide some specific warnings regarding Datastream. Sometimes, due to loads on the database, there may be delays in receiving data, or when the database is restarted, or during upgrade processes, the publications and replications created in the database may be deleted. In such cases, they need to be recreated for Datastream to receive data again. In these situations, you might consider setting up an alarm on GCP for Data Freshness, triggering an alert when the system encounters errors due to the inability to receive data. You can manage the alarm system through Log Explorer. We can discuss this in more detail in our next topic.
For now, I’ve explained how CDC is done with Datastream and how to connect to BigQuery. I hope I could be of assistance.
STAY WITH LOVE…
REFERENCES