Hi, thanks for the content, I find it very useful and it was worth of a couple of coffees. Currently I am facing quite a few difficulties with schema definition when reading from a web API. It is way too strict when reading and fails with things such as a simple 0 in column I defined as DoubleType. Or even with inferSchema (cant merge Long with Double error). What I ended up doing is define everything in initial schema as StringType and later casting date and numerical columns. There must be a better way surely, have you faced this issue in your projects? I know this isn't tech support so no issue if you don't reply, still appreciate the content :)
I have solved a similar issue in the past using the exactly the same method you are using right now. With that method you make sure that you won't lose any data. Other method that came to my mind is to use permissive mode when reading the data. In my opinion, it is not that good and you end up with data in the corrupt_record column. Permissive works like this: df = spark.read.option("mode", "PERMISSIVE")..... I always try to answer to every comment and help out if I can. 😊 However, sometimes it can take days for me to reply if I am busy and the problem is complex. Some of the issues are just way too complex for me to solve them in reasonable amount of time or without seeing the issue with my own eyes. Thanks a lot for the coffees! You are the first one to send me those. ☕😊
You can do that easily with a data pipeline tool. However, if you want to use a PySpark notebook then it can be way more challenging and at this moment you would probably have to use ODBC or something similar to write data there.
Hi, thanks for the content, I find it very useful and it was worth of a couple of coffees.
Currently I am facing quite a few difficulties with schema definition when reading from a web API. It is way too strict when reading and fails with things such as a simple 0 in column I defined as DoubleType. Or even with inferSchema (cant merge Long with Double error). What I ended up doing is define everything in initial schema as StringType and later casting date and numerical columns. There must be a better way surely, have you faced this issue in your projects?
I know this isn't tech support so no issue if you don't reply, still appreciate the content :)
I have solved a similar issue in the past using the exactly the same method you are using right now. With that method you make sure that you won't lose any data.
Other method that came to my mind is to use permissive mode when reading the data.
In my opinion, it is not that good and you end up with data in the corrupt_record column.
Permissive works like this:
df = spark.read.option("mode", "PERMISSIVE").....
I always try to answer to every comment and help out if I can. 😊
However, sometimes it can take days for me to reply if I am busy and the problem is complex.
Some of the issues are just way too complex for me to solve them in reasonable amount of time or without seeing the issue with my own eyes.
Thanks a lot for the coffees!
You are the first one to send me those. ☕😊
@AleksiPartanenTech well deserved! Enjoy and thanks for the content, I'll be around for more since I'm building a lot of stuff right now
is there any way to copy data to warehouse tables from lakehouse files instead of adls or s3??
You can do that easily with a data pipeline tool. However, if you want to use a PySpark notebook then it can be way more challenging and at this moment you would probably have to use ODBC or something similar to write data there.