Skip to content

scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of string #2322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ashishgupta2014 opened this issue Jan 13, 2025 · 4 comments

Comments

@ashishgupta2014
Copy link

I am running into issue while reading data from Elastic Search.
Following steps to reproduce.
`
SPARK_CONF = {
"spark.jars.packages": "org.elasticsearch:elasticsearch-spark-30_2.12:8.15.1"
}

query = {
'query': {'match_all': {}}
}

schema = StructType([
#------
StructField("tags", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("owningOrganizationUnit", StringType(), True),
StructField("parentId", StringType(), True),
StructField("tagId", StringType(), True),
StructField("tagName", StringType(), True),
StructField("tenantId", StringType(), True),
]), True), True),
StructField("securityContext", StructType([
StructField("tenantId", StringType(), True),
StructField("owningOrganizationUnit", ArrayType(StringType(), True), True)
]), True)
])
es_index_conf = {
'es.net.ssl': 'true',
'es.nodes.wan.only': 'true',
'es.read.field.as.array.include': 'securityContext.owningOrganizationUnit',
'es.resource': ****,
'es.nodes': -----,
'es.port': ------,
'es.net.http.auth.user': -------,
'es.net.http.auth.pass': ---------,
'es.query': json.dumps(query)
}
conf = SparkConf()
for k, v in self.app_config.SPARK_CONF.items():
conf.set(k, v)
spark = SparkSession.builder.config(conf=conf).appName("Sample").getOrCreate()
df = spark.read
.format("org.elasticsearch.spark.sql")
.options(**es_index_conf)
.schema(schema)
.load()
I am getting below erro with stack trace
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/13 16:23:10 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkRuntimeException: Error while encoding: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of string

Caused by: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of string
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ValidateExternalType_58$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_65$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_2_11$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:207)
... 20 more

`

@masseyke
Copy link
Member

You get this when there is no data in your index? If it is data-dependent, can you include the data to reproduce this?

@ashishgupta2014
Copy link
Author

Hi @masseyke
I am sharing sample data. Please have a review.
data.json

@masseyke
Copy link
Member

Thanks. Can you also provide mappings for your index? Ideally, could you provide all of the python commands (including creating mappings and indices, and loading data) to reproduce it? If it helps, I use this docker image for this type of thing: https://github.com/masseyke/es-spark-docker.

@ashishgupta2014
Copy link
Author

es_mapping.json

`
from pyspark.sql.types import (
StructType, StructField, StringType, ArrayType, BooleanType, TimestampType
)

Define the schema

schema = StructType([
# StructField("profile", StructType([
# StructField("id", StringType(), True),
# StructField("prefix", StringType(), True),
# StructField("title", StringType(), True),
# StructField("firstName", StringType(), True),
# StructField("lastName", StringType(), True),
# StructField("email", StringType(), True),
# StructField("primaryPhoneNumber", StringType(), True),
# StructField("secondaryPhoneNumber", StringType(), True),
# StructField("degree", StringType(), True),
# StructField("isAlumni", StringType(), True),
# StructField("isActive", BooleanType(), True),
# StructField("graduationDate", StringType(), True),
# StructField("workingHours", StringType(), True),
# StructField("facultyType", StringType(), True),
# StructField("resumeFileCollectionKey", StringType(), True),
# StructField("isPreceptor", BooleanType(), True),
# StructField("slotRequest", BooleanType(), True),
# StructField("emailNotification", BooleanType(), True),
# StructField("designation", ArrayType(StringType(), True), True),
# StructField("program", StringType(), True)
# ]), True),
StructField("tags", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("owningOrganizationUnit", StringType(), True),
StructField("parentId", StringType(), True),
StructField("tagId", StringType(), True),
StructField("tagName", StringType(), True),
StructField("tenantId", StringType(), True),
]), True), True),
StructField("category", ArrayType(StructType([
StructField("categoryId", StringType(), True),
StructField("categoryName", StringType(), True),
StructField("id", StringType(), True),
StructField("owningOrganizationUnit", StringType(), True),
StructField("parentId", StringType(), True),
StructField("tenantId", StringType(), True),
]), True), True),
StructField("demographic", StructType([
StructField("nationalProviderIdentifier", StringType(), True),
StructField("credentials", StringType(), True),
StructField("practiceSettings", ArrayType(StringType(), True), True)
]), True),
StructField("mailingAddress", StructType([
StructField("addressLine1", StringType(), True),
StructField("addressLine2", StringType(), True),
StructField("city", StringType(), True),
StructField("state", StringType(), True),
StructField("zipCode", StringType(), True)
]), True),
StructField("assistant", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("parentId", StringType(), True)
]), True), True),
StructField("licences", ArrayType(StructType([
StructField("expiryDate", TimestampType(), True),
StructField("isOpenEnded", StringType(), True),
StructField("issueDate", TimestampType(), True),
StructField("issuedBy", StringType(), True),
StructField("licenseNumber", StringType(), True),
StructField("licenseType", StringType(), True),
StructField("licensureId", StringType(), True),
StructField("notes", StringType(), True),
StructField("parentId", StringType(), True),
]), True), True),
StructField("certifications", ArrayType(StructType([
StructField("areaOfPractice", StringType(), True),
StructField("certId", StringType(), True),
StructField("certificationNumber", StringType(), True),
StructField("dateOfCertification", TimestampType(), True),
StructField("expirationDate", TimestampType(), True),
StructField("isOpenEnded", StringType(), True),
StructField("name", StringType(), True),
StructField("notes", StringType(), True),
StructField("parentId", StringType(), True),
]), True), True),
# StructField("associatedSites", ArrayType(StructType([
# StructField("siteId", StringType(), True),
# StructField("siteName", StringType(), True),
# StructField("siteOrgId", StringType(), True),
# StructField("siteIsActive", BooleanType(), True),
# StructField("isPrimary", BooleanType(), True),
# StructField("organizationUnitName", StringType(), True)
# ]), True), True),
StructField("associatedLocations", ArrayType(StructType([
StructField("locationName", StringType(), True),
StructField("locationId", StringType(), True),
StructField("siteId", StringType(), True),
StructField("siteName", StringType(), True),
StructField("locationIsActive", StringType(), True),
StructField("siteOrgId", StringType(), True),
StructField("siteIsActive", StringType(), True)
]), True), True),
StructField("sharedNotes", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("note", StringType(), True),
StructField("parentId", StringType(), True),
]), True), True),
StructField("internalNotes", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("note", StringType(), True),
StructField("parentId", StringType(), True),
]), True), True),
StructField("documents", ArrayType(StructType([
StructField("documentId", StringType(), True),
StructField("fileCollectionKey", StringType(), True),
StructField("name", StringType(), True),
StructField("parentId", StringType(), True),
]), True), True),
StructField("securityContext", StructType([
StructField("tenantId", StringType(), True),
StructField("owningOrganizationUnit", ArrayType(StringType(), True), True)
]), True),
StructField("esLoadedDate", TimestampType(), True)
])

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants