Loading CSV Data From Local Machine To BigQuery Via API
- Read
- Discuss
Here is the solution diagram for loading CSV data from local machine to bigquery:
Here, we are getting the Covid19 CSV dataset from github and downloading to local machine. The data is then loaded to the BigQuery dataset that we have already created, via the BigQuery python API
Now we will head to the Visual Studio Code and start writing our python code for loading CSV data from local machine to BigQuery via API. We will create a new python file class bqLoad.py.
Code Steps:
Our code has the following steps:
- Import google cloud libraries
- Specify path to GCP credentials
- Create bigquery client object
- Specify bigQuery table ID
- Give path to csv file on local machine
- Specify load job configuration
- Make API request
Code:
#import libraries
from google.cloud import bigquery
#path to GCP credentials json
credentials = bigquery.Client.from_service_account_json(r'C:\Users\Sherry\Downloads\pelagic-earth-353304-5e9ab57c7481.json')
#function to load table
def load_table_file(file_path: str, table_id: str) -> "bigquery.Table":
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "pelagic-earth-353304-5e9ab57c7481.covid_dataset.covid19_vaccination"
#path to csv file
file_path=r"C:\Users\Sherry\Downloads\locations.csv"
#load job configuration
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
)
with open(file_path, "rb") as source_file:
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
job.result() # Waits for the job to complete.
table = client.get_table(table_id) # Make an API request.
#print number of loaded rows and columns
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
# [END bigquery_load_from_file]
return table
Load CSV data from Cloud Storage to BigQuery via API
Here is the solution diagram for loading CSV data from google cloud storage to bigquery:
Here, we are getting the Covid19 CSV dataset from github and uploading to to local google cloud storage. The data is then loaded to the BigQuery dataset that we have already created, via the BigQuery python API. Following is the step-by step implementation:
- Creating Google Cloud Storage Bucket and Uploading data
Firstly, we need to create a google cloud storage bucket on GCP for this task:
- Go to Cloud Storage from search bar or Navigation menu.Click on Buckets from the left navigation menu.
- Click on +Create to create the bucket.
- This will open a window for creating a bucket. In the Name Your Bucket section, give a unique name to your bucket, I’ll choose demo-bucket-321.
- Next, we have the Choose where to store your data section. Here, we need to specify the Location Type and Location. There are three location types: Multi-region, Dual-region and Region. I’ll choose Region for this demo and the region will be us-east1.
- After that, we have the option to choose a storage class for our data. We can choose from Standard, Nearline, Coldline and Archive. We will go with Standard.
- Next, we need to choose the Access control, which can either be Uniform or Fine Grained. We are selecting Uniform. We also have the checkbox for Enforce public access prevention on this bucket. We will keep it checked. Leave the data protection options as-is, and click on Create.
- After creating of bucket, we will be redirected inside it.
- For this lab, we need to upload our Covid19 vaccination CSV dataset inside our storage bucket. For that, we will click on Upload Files inside our bucket, and choose the CSV file from local machine. After uploading, we can see our file.
Load Data from Cloud Storage to BigQuery via Python API
Now we will head to the Visual Studio Code and start writing our python code for loading CSV data from google cloud storage to BigQuery via API. We will create a new python file class bqStorageLoad.py.
Code Steps:
Our code has the following steps:
- Import google cloud libraries
- Specify path to GCP credentials
- Create bigquery client object
- Specify bigQuery table ID
- Specify Load Job Configuration
- Provide storage bucket path
- Make API request
#import libraries
from google.cloud import bigquery
#credentials
credentials = bigquery.Client.from_service_account_json(r'C:\Users\Sherry\Downloads\pelagic-earth-353304-5e9ab57c7481.json')
#bigquery client call
client = bigquery.Client()
# Set the dataset
dataset_id = 'covid_dataset'
#set table
table_id = 'covid19_vaccination'
#set table reference
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)
# Configure the load job
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV #specify source format as csv
job_config.skip_leading_rows = 1
job_config.autodetect = True #auto detect schema
# Start the load job
uri = "gs://demo-bucket-3211/locations.csv" #path to storage bucket
#load job
load_job = client.load_table_from_uri(
uri, table_ref, job_config=job_config
) # API request
load_job.result() # Waits for table load to complete.
#print number of loaded rows
print("Loaded {} rows into {}:{}.".format(
load_job.output_rows, dataset_id, table_id))
Leave a Reply
You must be logged in to post a comment.