How to Load Data from Tables in Pandas to BigQuery with Pandas
Google BigQuery is a fully managed cloud-native data warehouse that allows you to store and analyze large amounts of data quickly and easily. In this article, we’ll discuss how to load data from Pandas to BigQuery, using the Pandas GBQ library.
Prerequisites
Before we begin, you’ll need:
- A Google Cloud Platform (GCP) account
- A BigQuery project and dataset
- Python installed on your local machine
- The Pandas, Pandas GBQ, and google-cloud-bigquery libraries installed
Installation
To install the required libraries, run the following commands:
pip install pandas pandas-gbq google-cloud-bigquery
Code Example
Let’s say you have a CSV file named “data.csv” that you want to load into a Pandas DataFrame and then into a BigQuery table. Here’s an example of how you can do this:
import pandas as pd
from google.cloud import bigquery
# Authenticate and create a BigQuery client
client = bigquery.Client()
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('data.csv')
# Define the BigQuery table schema
schema = [
bigquery.SchemaField('column1', 'STRING'),
bigquery.SchemaField('column2', 'INTEGER'),
bigquery.SchemaField('column3', 'FLOAT')
]
# Create the BigQuery table
table_id = 'my_dataset.my_table'
table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table)
# Insert the data from the DataFrame into the BigQuery table
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.schema = schema
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result() # Waits for the job to complete.
print(f'Loaded {job.output_rows} rows into {table_id}.')
In this example, we first authenticate and create a BigQuery client using the google.cloud.bigquery
library. We then load the CSV file into a Pandas DataFrame using the pd.read_csv()
function. Next, we define the BigQuery table schema and create the table using the client.create_table()
method. Finally, we insert the data from the DataFrame into the table using the client.load_table_from_dataframe()
method.
Conclusion
Loading data from Pandas to BigQuery is a simple process using the Pandas GBQ library. With this method, you can easily store and analyze large amounts of data in BigQuery with the power and flexibility of Pandas.