FFL Hub, SQL Batched Inserts vs Joins, and the CS/CE Knowledge Divide

Before diving into this monster of a post (that I am sure is going to ruffle some feathers), its probably helpful to give some context.

Alongside being a Computer Engineering student, I have also been a long-time gun owner, machinist, and, since last year, the owner of a firearms business called Bickham Firearms. This business has an online presence, and I decided that it would perhaps be lucrative to delve into the firearms drop-shipping space as another source of revenue for the business.

Which brings me to the first challenge that needed to be solved: How do we manage and create a website that facilitates ordering firearms-related products online (think Amazon, but for guns)? Turns out, there are tons of solutions already available! Do they fit my needs? No!

Why the pre-existing solutions don’t fit my needs:

  1. They cost money. Upwards of $80/month just for the entry-level option. I am a poor college student. I don’t even have $80/month to spend on chicken nuggets and mac and cheese, let alone a piece of software that only MIGHT make money.
  2. They have authority over your business, your offered items, etc. I don’t like this. The entire spirit of gun ownership is self-reliance and independence.

So, based on these two gripes, I decided to roll my own! For one, it’s free. This solves issue #1. However, we also have FULL control over everything. I choose how catalogs are imported and updated, I choose the shipping cost rules, etc.

I call it FFL Hub, and it is a WordPress plugin that integrates into WooCommerce that handles external distributors, large product catalogs, the syncing of the pricing and quantity data of these catalogs, and automated ordering/shipping processing. Take all of these things that are already hard on their base level, and then add the fact that the firearms industry is regulated into the dirt. In total, it makes this entire ordeal a big one to tackle.


The Archecture:

At a high level, the system pulls large product catalogs and inventory data from external vendor APIs and FTP feeds, normalizes that data into a consistent internal format, and keeps an online storefront synchronized with real distributor inventory.

On the order side, it builds structured order placement jobs and dispatches them to the appropriate distributor systems through background processing pipelines.

From an engineering perspective, this means dealing with:

  • Large periodic data refreshes
  • Inconsistent external schemas
  • Partial API failures
  • Retry and backoff logic
  • Idempotency guarantees to prevent duplicate external orders

In this post, we’ll primarily focus on the catalog syncing and inventory quantity synchronization portion of the system.


The Catalog Syncing System: A Tale of Two APIs

When I first started to write this code, I had two main distributors that I dealt with: Lipsey’s and RSR. Lipsey’s mainly supplies me with my firearms, whilst RSR supplies me with my ammo/accessories (RSR bans small-time businesses from purchasing firearms).

Both of these distributors offer ways in which you can grab catalog data, just in two VERY different ways.

Lipsey’s

Lipsey’s serves up two API endpoints that are of interest to us:

  1. Catalog
  2. PricingAndQuantity

These two are pretty self-explanatory. They are both retrieved through a POST request to the endpoint URLs they provide, but each one behaves a bit differently.

The Catalog endpoint updates once every 4 hours. It provides no context as to when the next update will be. This means that staying in phase with catalog updates is impossible. Fortunately, this is not an issue because of our second endpoint.

The PricingAndQuantity endpoint updates every hour. This is stated in their API docs. This is a lie. It actually updates every 30 minutes, and I know this because they DO provide a field in the POST request return message called “nextUpdate”. We use this nextUpdate value to time our API calls so that it only really runs when there is new data for us to churn through.

It is CRUCIAL that we time the pricing and quantity updates in phase so that we aren’t serving stale pricing and quantity data to the customer. We don’t want a customer ordering something that is out of stock. That is bad.

RSR

RSR serves up API endpoints, but they are non-functional. Their documentation does not state this (and it also does not state the proper calling parameters for their ordering calls, but that is another issue).

Instead, they provide two files on an FTP server. One is for the overall catalog (updated once every 2 hours), and the other one is for the quantity data (updated every 5 minutes).

There is ZERO heads up as to when the next update will be; therefore, we use a heat-up and cooldown rule to log in to the FTP server only when we can safely assume a file update is reasonably close. This is to save on the time it takes to establish an FTP connection. Once we have established a connection, we check the last modified date along with file size to make sure it is fresh data that is not currently being edited.


How this ties into SQL Optimizations

So, with all of these methods of grabbing product data figured out, how do we efficiently update our SQL tables to reflect the latest pricing and quantity info?

I can tell you how to NOT do it: Batched Inserts.

And I learned this one the hard way. For the Lipsey’s catalog/quantity update cron jobs, I was seeing runtimes of up to 15 seconds and memory usage of up to 200MB. The results were similar for RSR, minus the memory usage.

Here is a little rule that I have learned through this entire ordeal that will stick with me until the day I die:

Don’t trust code you didn’t write without reading it first.

As it turns out, the default PHP implementation that Lipsey’s provides streams the entire endpoint result straight into memory and lets it sit there as a giant keyed array until you do something with it (like taking each item and batch inserting the new data into your SQL table).

This problem was solved by taking their code, throwing it out the window, and instead writing a little snippet that streams the endpoint result into a TSV file (for both Catalog updates and for Quantity updates). This solved the memory usage problem, but what about the run time?

Enter: Join statements

Here’s another rule that I picked up:

For things such as SQL, its better to rely on the low-level optimized code for massive table edits rather than rolling your own (most of the time).

The thing is, inserting the new data into the table as batches was SLOW. REALLY SLOW.

What ended up being the silver bullet to this problem was instead of batching inserts, we simply take that TSV file (whether it be a Catalog file or a quantity update file), use file loading to load it into a staging table, then we INSERT JOIN that table into our authoritative catalog table (in reality, there is a bit more to it, but this was the BIG gotcha that ended up halving our update times).


The Results:

By doing this, here is what we achieved speed wise:

Lispsey’s:

  • Old Catalog Update: 15s (7s from endpoint + 8s SQL updates)
  • New Catalog Update: 7.1s (7s from endpoint + 100ms SQL updates)

RSR:

  • Old Catalog Update: 5s (2s FTP download + 3s SQL updates)
  • New Catalog Update: 2.1s (2s FTP download + 100ms SQL updates)

Pretty astonishing speedups, and now most the time spent is spent on things out of our hands (which we mitigate as stated early through heat up and cooldown rules).


The ECE vs CS Knowledge Divide

If you have read this far, you have probably been scratching your head as to why I never knew that JOIN statements were preferred in situations like this in the first place.

This is because as an ECE (Computer Engineering) major, we learn NOTHING about SQL, databases, or anything web stack related. I come from the land of Assembly, C, and systems programming.

In contrast, CS (Computer Science) majors deal with this sort of stuff all the time. It is just about every other day that I hear a CS major walking down the hall talking about their new React project or their new C# web app that deals with databases. On the flip side, when they do end up having to take Operating Systems (a class that is exclusively taught through C and PintOS), they get absolutely walloped. So I guess in the end the learning struggles just about even out.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *