Class DataLayer

    • Field Summary

      Fields 
      Modifier and Type Field Description
      static long serialVersionUID  
    • Constructor Summary

      Constructors 
      Constructor Description
      DataLayer()  
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      org.apache.cassandra.bridge.BigNumberConfig bigNumberConfig​(org.apache.cassandra.spark.data.CqlField field)
      DataLayer can override this method to return the BigInteger/BigDecimal precision/scale values for a given column
      abstract org.apache.cassandra.bridge.CassandraBridge bridge()  
      abstract org.apache.cassandra.spark.data.CqlTable cqlTable()  
      protected abstract java.util.concurrent.ExecutorService executorService()
      DataLayer implementation should provide an ExecutorService for doing blocking I/O when opening SSTable readers.
      abstract boolean isInPartition​(int partitionId, java.math.BigInteger token, java.nio.ByteBuffer key)  
      abstract java.lang.String jobId()  
      org.apache.cassandra.spark.reader.StreamScanner openCompactionScanner​(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters, org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter)  
      org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.RowData> openCompactionScanner​(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters, org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter, org.apache.cassandra.spark.sparksql.filters.PruneColumnFilter columnFilter)  
      org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.IndexEntry> openPartitionSizeIterator​(int partitionId)  
      abstract int partitionCount()  
      abstract org.apache.cassandra.spark.data.partitioner.Partitioner partitioner()  
      java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFiltersInRange​(int partitionId, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters)  
      org.apache.spark.sql.types.StructType partitionSizeStructType()  
      boolean readIndexOffset()
      When true the SSTableReader should attempt to find the offset into the Data.db file for the Spark worker's token range.
      java.util.List<org.apache.cassandra.spark.config.SchemaFeature> requestedFeatures()  
      org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter sparkRangeFilter​(int partitionId)
      DataLayer implementation should provide a SparkRangeFilter to filter out partitions and mutations that do not overlap with the Spark worker's token range
      abstract org.apache.cassandra.spark.data.SSTablesSupplier sstables​(int partitionId, org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter sparkRangeFilter, java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters)  
      org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter()
      Returns SSTableTimeRangeFilter to filter out SSTables based on min and max timestamp.
      org.apache.cassandra.analytics.stats.Stats stats()
      Override to plug in your own Stats instrumentation for recording internal events
      org.apache.spark.sql.types.StructType structType()
      Map Cassandra CQL table schema to SparkSQL StructType
      abstract org.apache.cassandra.spark.utils.TimeProvider timeProvider()  
      org.apache.cassandra.spark.data.converter.SparkSqlTypeConverter typeConverter()  
      org.apache.spark.sql.sources.Filter[] unsupportedPushDownFilters​(org.apache.spark.sql.sources.Filter[] filters)  
      boolean useIncrementalRepair()
      When true the SSTableReader should only read repaired SSTables from a single 'primary repair' replica and read unrepaired SSTables at the user set consistency level
      org.apache.cassandra.bridge.CassandraVersion version()  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • DataLayer

        public DataLayer()
    • Method Detail

      • partitionSizeStructType

        public org.apache.spark.sql.types.StructType partitionSizeStructType()
        Returns:
        SparkSQL table schema expected for reading Partition sizes with PartitionSizeTableProvider.
      • structType

        public org.apache.spark.sql.types.StructType structType()
        Map Cassandra CQL table schema to SparkSQL StructType
        Returns:
        StructType representation of CQL table
      • requestedFeatures

        public java.util.List<org.apache.cassandra.spark.config.SchemaFeature> requestedFeatures()
      • bigNumberConfig

        public org.apache.cassandra.bridge.BigNumberConfig bigNumberConfig​(org.apache.cassandra.spark.data.CqlField field)
        DataLayer can override this method to return the BigInteger/BigDecimal precision/scale values for a given column
        Parameters:
        field - the CQL field
        Returns:
        a BigNumberConfig object that specifies the desired precision/scale for BigDecimal and BigInteger
      • version

        public org.apache.cassandra.bridge.CassandraVersion version()
        Returns:
        Cassandra version (3.0, 4.0 etc)
      • bridge

        public abstract org.apache.cassandra.bridge.CassandraBridge bridge()
        Returns:
        version-specific CassandraBridge wrapping shaded packages
      • typeConverter

        public org.apache.cassandra.spark.data.converter.SparkSqlTypeConverter typeConverter()
        Returns:
        SparkSQL type converter that maps version-specific Cassandra types to SparkSQL types
      • partitionCount

        public abstract int partitionCount()
      • cqlTable

        public abstract org.apache.cassandra.spark.data.CqlTable cqlTable()
        Returns:
        CqlTable object for table being read, batch/bulk read jobs only
      • isInPartition

        public abstract boolean isInPartition​(int partitionId,
                                              java.math.BigInteger token,
                                              java.nio.ByteBuffer key)
      • timeProvider

        public abstract org.apache.cassandra.spark.utils.TimeProvider timeProvider()
        Returns:
        a TimeProvider
      • partitionKeyFiltersInRange

        public java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFiltersInRange​(int partitionId,
                                                                                                                         java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters)
                                                                                                                  throws org.apache.cassandra.spark.sparksql.NoMatchFoundException
        Throws:
        org.apache.cassandra.spark.sparksql.NoMatchFoundException
      • sparkRangeFilter

        public org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter sparkRangeFilter​(int partitionId)
        DataLayer implementation should provide a SparkRangeFilter to filter out partitions and mutations that do not overlap with the Spark worker's token range
        Parameters:
        partitionId - the partitionId for the task
        Returns:
        SparkRangeFilter for the Spark worker's token range
      • sstableTimeRangeFilter

        @NotNull
        public org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter()
        Returns SSTableTimeRangeFilter to filter out SSTables based on min and max timestamp.
        Returns:
        SSTableTimeRangeFilter
      • executorService

        protected abstract java.util.concurrent.ExecutorService executorService()
        DataLayer implementation should provide an ExecutorService for doing blocking I/O when opening SSTable readers. It is the responsibility of the DataLayer implementation to appropriately size and manage this ExecutorService.
        Returns:
        executor service
      • sstables

        public abstract org.apache.cassandra.spark.data.SSTablesSupplier sstables​(int partitionId,
                                                                                  @Nullable
                                                                                  org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter sparkRangeFilter,
                                                                                  @NotNull
                                                                                  java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters)
        Parameters:
        partitionId - the partitionId of the task
        sparkRangeFilter - spark range filter
        partitionKeyFilters - the list of partition key filters
        Returns:
        set of SSTables
      • partitioner

        public abstract org.apache.cassandra.spark.data.partitioner.Partitioner partitioner()
      • jobId

        public abstract java.lang.String jobId()
        Returns:
        a string that uniquely identifies this Spark job
      • openCompactionScanner

        public org.apache.cassandra.spark.reader.StreamScanner openCompactionScanner​(int partitionId,
                                                                                     java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters,
                                                                                     org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter)
      • readIndexOffset

        public boolean readIndexOffset()
        When true the SSTableReader should attempt to find the offset into the Data.db file for the Spark worker's token range. This works by first binary searching the Summary.db file to find offset into Index.db file, then reading the Index.db from the Summary.db offset to find the first offset in the Data.db file that overlaps with the Spark worker's token range. This enables the reader to start reading from the first in-range partition in the Data.db file, and close after reading the last partition. This feature improves scalability as more Spark workers shard the token range into smaller subranges. This avoids wastefully reading the Data.db file for out-of-range partitions.
        Returns:
        true if, the SSTableReader should attempt to read Summary.db and Index.db files to find the start index offset into the Data.db file that overlaps with the Spark workers token range
      • useIncrementalRepair

        public boolean useIncrementalRepair()
        When true the SSTableReader should only read repaired SSTables from a single 'primary repair' replica and read unrepaired SSTables at the user set consistency level
        Returns:
        true if the SSTableReader should only read repaired SSTables on single 'repair primary' replica
      • openCompactionScanner

        public org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.RowData> openCompactionScanner​(int partitionId,
                                                                                                                                java.util.List<org.apache.cassandra.spark.sparksql.filters.PartitionKeyFilter> partitionKeyFilters,
                                                                                                                                org.apache.cassandra.spark.sparksql.filters.SSTableTimeRangeFilter sstableTimeRangeFilter,
                                                                                                                                @Nullable
                                                                                                                                org.apache.cassandra.spark.sparksql.filters.PruneColumnFilter columnFilter)
        Returns:
        CompactionScanner for iterating over one or more SSTables, compacting data and purging tombstones
      • openPartitionSizeIterator

        public org.apache.cassandra.spark.reader.StreamScanner<org.apache.cassandra.spark.reader.IndexEntry> openPartitionSizeIterator​(int partitionId)
        Parameters:
        partitionId - Spark partition id
        Returns:
        a PartitionSizeIterator that iterates over Index.db files to calculate partition size.
      • unsupportedPushDownFilters

        public org.apache.spark.sql.sources.Filter[] unsupportedPushDownFilters​(org.apache.spark.sql.sources.Filter[] filters)
        Parameters:
        filters - array of push down filters that
        Returns:
        an array of push filters that are not supported by this data layer
      • stats

        public org.apache.cassandra.analytics.stats.Stats stats()
        Override to plug in your own Stats instrumentation for recording internal events
        Returns:
        Stats implementation to record internal events