Initial Implementation

sdrp713 · sdrp713 · commit 80da99b63d03 · 2026-06-09T15:04:52.000-07:00
diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md
@@ -50,6 +50,7 @@ Name | Description | Default Value | Applicable at
 <a name="memory.gpu.reserve"></a>spark.rapids.memory.gpu.reserve|The amount of GPU memory that should remain unallocated by RMM and left for system use such as memory needed for kernels and kernel launches.|671088640|Startup
 <a name="memory.gpu.state.debug"></a>spark.rapids.memory.gpu.state.debug|To better recover from out of memory errors, RMM will track several states for the threads that interact with the GPU. This provides a log of those state transitions to aid in debugging it. STDOUT or STDERR will have the logging go there empty string will disable logging and anything else will be treated as a file to write the logs to.||Startup
 <a name="memory.gpu.unspill.enabled"></a>spark.rapids.memory.gpu.unspill.enabled|When a spilled GPU buffer is needed again, should it be unspilled, or only copied back into GPU memory temporarily. Unspilling may be useful for GPU buffers that are needed frequently, for example, broadcast variables; however, it may also increase GPU memory usage|false|Startup
+<a name="perfio.gcs.enabled"></a>spark.rapids.perfio.gcs.enabled|Controls the Google Cloud Storage reader for improved performance in certain queries. When true, enables it and throws at startup if google-cloud-storage classes are not on the classpath. When false, disables it unconditionally. When unset (default), enables it opportunistically if google-cloud-storage classes are found, otherwise falls back to the configured GCS connector with a warning. The presence of com.google.cloud:google-cloud-storage on the executor classpath is required.|None|Startup
 <a name="perfio.s3.enabled"></a>spark.rapids.perfio.s3.enabled|Controls the AWS S3 reader for improved performance in certain queries. When true, enables it and throws at startup if no compatible HTTP client is on the classpath. When false, disables it unconditionally. When unset (default), enables it opportunistically if a compatible HTTP client is found, otherwise falls back to S3A with a warning. The presence of AWS SDK packages for Netty and/or CRT HTTP clients on the classpath is required. You can use Spark submit option `--packages software.amazon.awssdk:s3:2.22.12,software.amazon.awssdk:aws-crt-client:2.22.12` to achieve this. See https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html#crt-based-s3-client-depend|None|Startup
 <a name="python.concurrentPythonWorkers"></a>spark.rapids.python.concurrentPythonWorkers|Set the number of Python worker processes that can execute concurrently per GPU. Python worker processes may temporarily block when the number of concurrent Python worker processes started by the same executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors. >0 means enabled, while <=0 means unlimited|0|Runtime
 <a name="python.memory.gpu.allocFraction"></a>spark.rapids.python.memory.gpu.allocFraction|The fraction of total GPU memory that should be initially allocated for pooled memory for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.allocFraction)), since the executor will share the GPU with its owning Python workers. Half of the rest will be used if not specified|None|Runtime
diff --git a/sql-plugin/src/main/java/com/nvidia/spark/rapids/fileio/RapidsInputFiles.java b/sql-plugin/src/main/java/com/nvidia/spark/rapids/fileio/RapidsInputFiles.java
@@ -38,4 +38,17 @@ public static boolean isS3PerfEnabled() {
         }
         return env.conf().getBoolean(PerfIOConf.S3PERF_ENABLED().key(), false);
     }
+    /**
+     * True iff {@code spark.rapids.perfio.gcs.enabled} is set to {@code true} on
+     * the active SparkConf. Returns false when no {@link SparkEnv} is initialized
+     * (e.g. before driver bring-up) so callers default to the non-PerfIO path.
+     */
+    public static boolean isGCSPerfEnabled() {
+        SparkEnv env = SparkEnv.get();
+        if (env == null) {
+            return false;
+        }
+        return env.conf().getBoolean(PerfIOConf.GCSPERF_ENABLED().key(), false);
+    }
+
 }
diff --git a/sql-plugin/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/HadoopFileIO.java b/sql-plugin/src/main/java/com/nvidia/spark/rapids/fileio/hadoop/HadoopFileIO.java
@@ -51,6 +51,10 @@ public RapidsInputFile newInputFile(Path path) throws IOException {
         if (scheme != null && scheme.startsWith("s3") && RapidsInputFiles.isS3PerfEnabled()) {
             return S3InputFile.create(path, hadoopConf.value());
         }
+        if (scheme != null && (scheme.equals("gs") || scheme.equals("gcs")) &&
+                RapidsInputFiles.isGCSPerfEnabled()) {
+            return GCSInputFile.create(path, hadoopConf.value());
+        }
         return HadoopInputFile.create(path, hadoopConf.value());
     }
 
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/fileio/hadoop/GCSInputFile.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/fileio/hadoop/GCSInputFile.scala
@@ -0,0 +1,90 @@
+/*
+ * Copyright (c) 2026, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.nvidia.spark.rapids.fileio.hadoop
+
+import java.io.IOException
+import java.net.URI
+import java.util.OptionalLong
+
+import scala.collection.JavaConverters._
+
+import ai.rapids.cudf.HostMemoryBuffer
+import com.nvidia.spark.rapids.{IntRangeWithOffset, PerfIO, RangeWithOffset, SuffixRangeWithOffset}
+import com.nvidia.spark.rapids.jni.fileio.{RapidsInputFile, SeekableInputStream}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+/**
+ * GCS-backed {@link RapidsInputFile} for Hadoop-conf-driven (non-iceberg) reads.
+ * {@code readVectored} issues batched byte-range reads through the optimized
+ * PerfIO path; the other operations delegate to the standard {@link HadoopInputFile}.
+ */
+class GCSInputFile private (
+    delegate: HadoopInputFile,
+    fileUri: URI,
+    hadoopConf: Configuration)
+  extends RapidsInputFile {
+
+  override def path(): String = delegate.path()
+
+  @throws[IOException]
+  override def getLength(): Long = delegate.getLength()
+
+  @throws[IOException]
+  override def getLastModificationTime(): OptionalLong = delegate.getLastModificationTime()
+
+  @throws[IOException]
+  override def open(): SeekableInputStream = delegate.open()
+
+  @throws[IOException]
+  override def readVectored(
+      output: HostMemoryBuffer,
+      copyRanges: java.util.List[RapidsInputFile.CopyRange]): Unit = {
+    val ranges = copyRanges.asScala.map { r =>
+      IntRangeWithOffset(r.getInputOffset, r.getLength, r.getOutputOffset)
+    }.toSeq
+    require(
+      PerfIO.readToHostMemory(hadoopConf, output, fileUri, ranges).isDefined,
+      "expected to use PerfIO to read")
+  }
+
+  /**
+   * Issue a single suffix-range read for the last {@code length} bytes. Avoids
+   * the {@code getLength()} round-trip the default {@link RapidsInputFile#readTail}
+   * would make. PerfIO resolves the GCS suffix range internally.
+   */
+  @throws[IOException]
+  override def readTail(length: Long, output: HostMemoryBuffer): Unit = {
+    if (length == 0) {
+      return
+    }
+    if (length < 0) {
+      throw new IllegalArgumentException("length must be non-negative")
+    }
+    val ranges = Seq[RangeWithOffset](SuffixRangeWithOffset(length, /*destOffset*/ 0L))
+    require(
+      PerfIO.readToHostMemory(hadoopConf, output, fileUri, ranges).isDefined,
+      "expected to use PerfIO to read")
+  }
+}
+
+object GCSInputFile {
+  @throws[IOException]
+  def create(filePath: Path, conf: Configuration): GCSInputFile = {
+    new GCSInputFile(HadoopInputFile.create(filePath, conf), filePath.toUri, conf)
+  }
+}

Original file line number	Diff line number	Diff line change
`@@ -38,4 +38,17 @@ public static boolean isS3PerfEnabled() {`
`38`	`38`	`}`
`39`	`39`	`return env.conf().getBoolean(PerfIOConf.S3PERF_ENABLED().key(), false);`
`40`	`40`	`}`
	`41`	`+ /**`
	`42`	`+ * True iff {@code spark.rapids.perfio.gcs.enabled} is set to {@code true} on`
	`43`	`+ * the active SparkConf. Returns false when no {@link SparkEnv} is initialized`
	`44`	`+ * (e.g. before driver bring-up) so callers default to the non-PerfIO path.`
	`45`	`+ */`
	`46`	`+ public static boolean isGCSPerfEnabled() {`
	`47`	`+ SparkEnv env = SparkEnv.get();`
	`48`	`+ if (env == null) {`
	`49`	`+ return false;`
	`50`	`+ }`
	`51`	`+ return env.conf().getBoolean(PerfIOConf.GCSPERF_ENABLED().key(), false);`
	`52`	`+ }`
	`53`	`+`
`41`	`54`	`}`
Original file line number	Diff line number	Diff line change
`@@ -51,6 +51,10 @@ public RapidsInputFile newInputFile(Path path) throws IOException {`
`51`	`51`	`if (scheme != null && scheme.startsWith("s3") && RapidsInputFiles.isS3PerfEnabled()) {`
`52`	`52`	`return S3InputFile.create(path, hadoopConf.value());`
`53`	`53`	`}`
	`54`	`+ if (scheme != null && (scheme.equals("gs") \|\| scheme.equals("gcs")) &&`
	`55`	`+ RapidsInputFiles.isGCSPerfEnabled()) {`
	`56`	`+ return GCSInputFile.create(path, hadoopConf.value());`
	`57`	`+ }`
`54`	`58`	`return HadoopInputFile.create(path, hadoopConf.value());`
`55`	`59`	`}`
`56`	`60`