Idempotency for BigQuery load jobs using Google.Cloud.BigQuery.V2

You are able to create a csv load job to load data from a csv file in Google Cloud Storage by using the BigQueryClient in Google.Cloud.BigQuery.V2 which has a CreateLoadJob method.

How can you guarantee idempotency with this API to ensure that say the network dropped before getting a response and you kicked off a retry you would not end up with the same data being loaded into BigQuery multiple times?

Example API usage

    private void LoadCsv(string sourceUri, string tableId, string timePartitionField)
    {
        var tableReference = new TableReference()
        {
            DatasetId = _dataSetId,
            ProjectId = _projectId,
            TableId = tableId
        };

        var options = new CreateLoadJobOptions
        {
            WriteDisposition = WriteDisposition.WriteAppend,
            CreateDisposition = CreateDisposition.CreateNever,
            SkipLeadingRows = 1,
            SourceFormat = FileFormat.Csv,
            TimePartitioning = new TimePartitioning
            {
                Type = _partitionByDayType,
                Field = timePartitionField
            }
        };

        BigQueryJob loadJob = _bigQueryClient.CreateLoadJob(sourceUri: sourceUri,
                                                            destination: tableReference,
                                                            schema: null,
                                                            options: options);

        loadJob.PollUntilCompletedAsync().Wait();
        if (loadJob.Status.Errors == null || !loadJob.Status.Errors.Any())
        {
            //Log success
            return;
        }
        //Log error
    }
Jon Skeet
people
quotationmark

There are two places you could end up losing the response:

  • When creating the job to start with
  • When polling for completion

The first one is relatively tricky to recover from without a job ID; you could list all the jobs in the project and try to find one that looks like the one you'd otherwise create.

However, the C# client library generates a job ID so that it can retry, or you can specify your own job ID via CreateLoadJobOptions.

The second failure time is much simpler: keep the returned BigQueryJob so you can retry the polling if that fails. (You could store the job name so that you can recover even if your process dies while waiting for it to complete, for example.)

people

See more on this question at Stackoverflow