A few months back our on-calls were facing hard times with the random spark job failures in our workflow. The resolution was only retrying the job which eventually succeeded.
In error logs according to spark it was:
s3://bucketName/prefix/obj.json file not found exception.
It was unnecessary botheration and also there were uncertain number of retries. Things became worse when it started to happen for multiple jobs. Also, the same sort of issues started coming in more systems.
On diving deep we found the culprit here was S3's eventual data consistency model.
S3 data consistency model
Simply put, S3 provides read after write consistency for the new objects being created in a bucket.
For the object deletion or modification it is eventually consistent .i.e you will be able to see the final thing but it might take some time. In most cases it is really lightning fast, issues mentioned here are seen once in a blue moon.
These rarest scenarios where even AWS S3 lagged, were most probably due to large data size.
In case you are wondering it was as large as ~2 Tb being poured in S3 in the matter of seconds/minutes by the spark.
Once I observed the data size getting reduced in AWS S3 console as well without any activity happening on bucket,most probable s3 was syncing its metadata at that time.
To be honest that was fascinating :P
What exactly was affected
- Spark jobs which were consuming some s3 data , which was getting overwritten by the producer jobs, were the ones getting failed,
giving exception error ‘file not found’.
This particular file was the one which was deleted earlier by the producer job. Spark listed that s3 file somehow, but since its deleted wasn’t able to read it and entire job failed.
This sort of file is termed as ghost files(One which is deleted but still listed by s3).
- One of the systems which listen to the AWS SQS for the object creation in S3 and indexing that particular object’s data in Elasticsearch was sometimes failing to do this simple task.
Here sometimes means out of 50k or 100k it was skipping one random object.
On debugging the issue seemed to be with the function of AWS SDK which checked the object’s existence in S3 first.
For the skipped object, SDK logs showed that the object was not found in S3 while checking for existence.
This seemed to be the same issue what is called as conceived files(One which is there but not listed)
Tackling the problem
Taking the benefit of S3 giving consistency for a newly created object and started having a corresponding manifest for the datasets now.
Everytime a job ran we started writing the data in HDFS first, instead of directly pushing to S3). Since HDFS is a file system unlike S3, it is consistent.
Using this we started generating manifest for the all the datasets by recurring over datasets in HDFS.
Also, we started appending a random hash of size 31 in the suffix of S3 path, making sure its always a new object and also I had read somewhere S3 path guidelines to have a random hash in the s3 path.
Thats why manifest was now much more a requirement while reading the data.
Now, the catch here was how to copy ~2TB data to S3 from HDFS, in real quick time. Normal AWS SDK/CLI commands were not that efficient and would have taken a lot more time than we had.
Here the s3-dist-cp tool came handy for the purpose. It basically launches a map reduce job to copy data, can copy data from and to, S3 and HDFS both.
Just to give an idea about its speed, it merely takes ~9 minutes to copy ~2TB of the data from HDFS to S3.
Now consumer spark jobs started reading from manifest which has entire s3 object list which has to be read.
Earlier we relied on spark’s read function.
Something like this : spark.read.json(“s3://bucket-name/path-to-read/*”)
Which lead us to above issues.
This resolved our issue of ghost files because now we knew what files have to be read.
For the conceived files, where S3 was not able to list the file we now had manifest as an additional resource to verify the S3’s verdict.
Now we were able to figure out when S3 was lying.
This resolved our conceived files issue.
More to look for
There are multiple solutions available for the problems due to S3 eventual consistency model. Few of links that might be helpful: