Quantcast
Channel: Trying to extract data from several thousand JSON files in Java with schema deviations - Stack Overflow
Viewing all articles
Browse latest Browse all 4

Trying to extract data from several thousand JSON files in Java with schema deviations

$
0
0

I've got several thousand JSON files. Most of them can have a single JSON array with as many as 10,000 elements inside the array ... and to make things even more interesting, the data structure of the elements can vary from element to element ... sometimes with just a simple single property deviation from the norm to deviations that add even more arrays within each element. But it's this "items" array that I need to extract from each of these files.

The method of attack on this problem - in my logic that is - is to first extract each of the different data structures from all of the files, so that I understand what I'm going after when I try to get the data. If I can't name the elements that I want, then how could I get them? Though there might actually be a way of doing that, I'm just not knowledgeable enough on JSON and GSON, etc. to know one way or the other.

This will be my first real JSON project as well ... I've not ever played with JSON before so I've spent a lot of time Googling and reading and I definitely understand - NOW - how JSON works ... I'm just ill-equipped to wield it with any kind of effectiveness. I've spent the last couple of days on these files, and although I've gained some ground, I'm smart enough to know when I've gotten to the point where I need some help from people who have done this before.

These examples are not cut and paste from these files. I made them generic for simplicity. But here is what I've seen so far as an example of the differences in structures from one file to the next. The first file is by far the most common ... where the "items" array will have that static structure with the exact same element names but there will be 10,000 of them within a file ... while the next file won't be so clean.

Most common JSON file that I am seeing among these files:

{"employees" : [    {"name": "John Doe"    },    {"name": "Jane Doe"    }  ],"items": [    {"item_name": "Goofy Widget","timestamp": 1616987224024,"contents": "Some really nice goofy widgets","item_type": "Cleaning Widget","for_sale": false    },    {"item_name": "Machine Widget","timestamp": 1616987218652,"contents": "Hand held vaccuum","item_type": "Functional Widget","for_sale": false    }  ],"items_from_inventory": true,"category_type": "Average","region_placement": "Northwest America"}

And having manually looked over several files, some can look like this, where there is deviation sometimes from one complete array element to the next:

{"employees" : [    {"name": "Jack Smith"    },    {"name": "Joe Smith"    },    {"name": "Jimmy Smalley"    }  ],"items": [    {"item_name": "Sneakers","timestamp": 1616987224024,"contents": "Plain white sneakers","item_type": "Foot Wear","for_sale": false    },    {"item_name": "Personal T-Shirts","timestamp": 1616987224024,"contents": "Color variety T-Shirts","color_options": [        {"color1": "Red","color2": "Green","color3": "Black","color4": "White"        }      ],"item_classifications": [        {"class1": "Weekend Use","class2": "Family Picnics","class3": "Casual Fridays"        },      ],"for_sale": false    },    {"item_name": "Basketballs","timestamp": 1616987218652,"contents": "Three quality basketballs","item_type": "Sport Items","brands": [        {"brand1": ",Spalding","brand2": "Wilson"        },      ],"for_sale": false    }  ],"items_from_inventory": false,"category_type": "Personal Use","region_placement": "North America"}

The basic core structure of these files are fairly consistent from one file to the next, the deviation seems to be mainly within the "items" array where some elements have a different data structure (schema as we know it in the MySql world) than others.

I've been mainly experimenting with GSON because it seems to be fairly popular, although I'm not concerned with what library(s) I use, I just need to get to the data.

I decided I'd start with targeting the most common array structure that I'm seeing so far, and this is what I came up with. Here is the class that represents the most common array structure:

package widgets;public class Widget {    public Widget(String itemName, long timestamp, String contents, String itemType, boolean forSale) {        this.itemName     = itemName;        this.timestamp    = timestamp;        this.contents     = contents;        this.itemType     = itemType;        this.forSale      = forSale;    }    private String             itemName;    private long               timestamp;    private String             contents;    private String             itemType;    private boolean            forSale;    public void setItemName(String itemName) { this.itemName = itemName;}    public void setTimestamp(long timestamp) { this.timestamp = timestamp;}    public void setContents(String contents) { this.contents = contents;}    public void setItemType(String itemType) { this.itemType = itemType;}    public void setForSale(boolean forSale)  { this.forSale = forSale;}    public String getItemName() { return itemName;}    public long getTimestamp()  { return timestamp;}    public String getContents() { return contents;}    public String getItemType() { return itemType;}    public boolean isForSale()  { return forSale;}    @Override    public String toString() {        return "senderName = " + this.itemName +"\n" +"timestamp = " + this.timestamp +"\n" +"content = " + this.contents +"\n" +"type = " + this.itemType +"\n" +"isUnsent = " + (this.forSale ? "true" : "false") +"\n";    }}

I kind of want to leave it right here and not really get into where I've succeeded and where I've failed because I don't really care about what I'm doing wrong, I just need to know how to do it right... so here is what I'm asking for:

Will someone show me how to extract all of the Json structure definitions from these files, including the different structures that can happen randomly within each "items" element?

And can someone show me how to properly extract the data given the fact that the structure of the "items" array can be different from one element to the next?

I just need someone who has been here before and can point me down the right path so that I don't have to walk each path, turn around and walk back then try another one.

I would be most grateful for the help.

Thank you,

Mike Sims


Viewing all articles
Browse latest Browse all 4

Latest Images

Trending Articles





Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>
<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596344.js" async> </script>