Skip to content
This repository has been archived by the owner on May 27, 2020. It is now read-only.

Schema discovery created unpersistable schema for empty arrays #114

Open
ssimeonov opened this issue Apr 9, 2016 · 4 comments
Open

Schema discovery created unpersistable schema for empty arrays #114

ssimeonov opened this issue Apr 9, 2016 · 4 comments

Comments

@ssimeonov
Copy link

If all observed values of a document field are [] the generated schema is for ArrayType[NullType] which cannot be persisted or used in any meaningful way.

In the absence of evidence of the type of array elements a more logical behavior would be to allow for overrides of the schema of a subset of fields, e.g., as a JSON string (schema.json in Spark) or, if a default behavior is needed, map to ArrayType[StringType] as opposed to ArrayType[NullType]. The benefits are that this mapping can be persisted and it can represent any Mongo arrays, including heterogeneous ones.

@pfcoperez pfcoperez self-assigned this Apr 26, 2016
@pfcoperez
Copy link
Contributor

pfcoperez commented Apr 26, 2016

This issues is somehow like #113, you should not assume a type from the lack of evidence just because it is convenient.

With the current connector API you have no problem if you want to provide your own schema:

/**
 * A MongoDB baseRelation that can eliminate unneeded columns
 * and filter using selected predicates before producing
 * an RDD containing all matching tuples as Row objects.
 * @param config A Mongo configuration with needed properties for MongoDB
 * @param schemaProvided The optionally provided schema. If not provided,
 *                       it will be inferred from the whole field projection
 *                       of the specified table in Spark SQL statement using
 *                       a sample ratio (as JSON Data Source does).
 * @param sqlContext An existing Spark SQL context.
 */
class MongodbRelation(private val config: Config,
                       schemaProvided: Option[StructType] = None)(
                       @transient val sqlContext: SQLContext) extends BaseRelation

That class will only try to infer the schema if schemaProvided is None :

  /**
   * Default schema to be used in case no schema was provided before.
   * It scans the RDD generated by Spark SQL statement,
   * using specified sample ratio.
   */
  @transient private lazy val lazySchema =
    MongodbSchema(
      new MongodbRDD(sqlContext, config, rddPartitioner),
      config.get[Any](MongodbConfig.SamplingRatio).fold(MongodbConfig.DefaultSamplingRatio)(_.toString.toDouble)).schema()

  override val schema: StructType = schemaProvided.getOrElse(lazySchema)

On the other hand, there are ways of building a modified schema from the inferred which you can use to provide MongodbRelation with.

@ssimeonov
Copy link
Author

ssimeonov commented Apr 26, 2016

That's good to know but are you suggesting that the desired behavior of the library is to produce unwritable schema in this case as opposed to a potentially incorrect but writeable schema? In that case, especially since the exception is rather cryptic, would it not be better to create some best practices and utilities around checking the schema after discovery and, perhaps, offering schema rewriting strategies, e.g., remove the field from the schema (because adding the field later on works well with Spark's schema merging but changing the type of a field does not)?

@pfcoperez
Copy link
Contributor

@ssimeonov The development team behind Spark-MongoDB have been talking about this issue. In spite of being not formally right as I mentioned above we've agreed to change it to StringType. A sweet spot between correctness and convenience.

@pfcoperez
Copy link
Contributor

@ssimeonov A PR has been issued #125 with the described change, it will soon merged.

Btw, quite a strange collection, or too low sample ratio, where not a single object has an array with at least one element.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants