Introduction

When I was a kid, I went through the same LEGO phase as most other kids. LEGOs were always on my birthday and Christmas wish lists. I loved coming home from school and finding the monthly LEGO catalog in the mail, where I would spend the next hour scanning each page for new sets that piqued my interest. Only the best sets would make it onto my birthday and Christmas wish lists, and it was up to my parents to figure out the logistics of getting these seemingly high-demand sets. While I did not receive every LEGO set I asked for, I still have fond memories building and playing with the ones I did receive.

Like most kids, I slowly grew out of LEGO. Nothing changed with LEGO; I was just getting older and no longer a part of the target demographic (it’s not you, it’s me). However, after many years and a degree in mechanical engineering, I have come back to LEGO. I have a new appreciation for the creative and advanced building techniques while also having a soft spot for the nostalgia of opening a box and building my toy. But now that I’m older, I understand the trouble my parents went through in finding some of these LEGO sets.

The final season of the animated TV show Star Wars: The Clone Wars aired during the beginning of the COVID-19 pandemic, and the ending was nothing short of spectacular. Some of the characters from the series were at their peak in popularity, and LEGO capitalized on this opportunity by releasing a 501st battle pack and an AAT featuring Ahsoka Tano and a 332nd clone trooper just a few months after the final episode aired. These sets were so popular among Star Wars fans that they were nearly impossible to get for several months. In fact, LEGO had to impose a limit of one-per-person due to the insane demand.

This frenzy is similar to the PlayStation 5 release in that it was/is nearly impossible to get your hands on one. Some very determined people used bots to purchase a PlayStation 5 whenever it was in stock so that they didn’t have to check multiple times a day for an inventory update. This inspired me to create something similar that would check online and in-store inventory for LEGO sets I wanted, and I believe I was able to do just that (I didn’t want to automatically purchase LEGO sets as I feel that would be unethical).

This blog serves as a walkthrough on how to get inventory status data for LEGO sets without visiting LEGO.com.

Preliminary Code

Import Libraries

Every project I work on is assisted by the tools and resources from various libraries, and here are the libraries I use for this one:

library(dplyr)
library(rvest)
library(jsonlite)
library(RSelenium)
library(knitr)

Here is what each library contributes to this project.

  • dplyr: grammar and data manipulation
  • rvest: scraping HTML data from website
  • jsonlite: parsing data in JSON format
  • RSelenium: scraping website data using headless web browsing
  • knitr: structural elements for R code output

Clean Workspace

Another thing I like to do at the beginning is clear the workspace and set the working directory. This elimnates any stale variables and confirms the file location for external dependencies.

rm(list = ls())
dirpath <- dirname(rstudioapi::getSourceEditorContext()$path)
setwd(dirpath)

Functions

The vast majority of code in the script is for functions. In this section, I will outline each function and describe their intended purpose.

XPathPartialClass

All data from a webpage is stored in some container. Most of these containers have attributes that specify the design and format of the web element. Some of the names for these attributes can be quite lengthy and specific, which can be problematic when trying to maintain a robust script. The XPathPartialClass function gives me the option to only provide as much of the attribute name as I want without using the whole name.

For example, one of the class names on the LEGO website is StoreCheckerstyles__StoreSelected-kr3ej5-3 ejSejf. Instead of searching all nodes for that string, I can simply enter StoreSelected as an argument in the XPathPartialClass function. This will get me what I want (assuming StoreSelected is unique) without relying on the seemingly random alphanumeric characters in the class name.

XPathPartialClass <- function(name, container = "div", attribute = "class") {
  return(
    paste0(
      '//',
      container,
      '[@',
      attribute,
      ' and contains(concat(" ", normalize-space(@',
      attribute,
      '), " "),"',
      name,
      '")]'
    )
  )
}

ResetPorts

This script will throw an error if the remote driver attempts to occupy a port that is already in use. ResetPorts frees up the ports to minimize this error.

ResetPorts <- function() {
  invisible(capture.output(
    system(
      "taskkill /im java.exe /f",
      intern = FALSE,
      ignore.stdout = FALSE,
      show.output.on.console = FALSE
    )
  ))
}

ProductOnline

Any data regarding item information–including its online availability–can be accessed without physically visiting LEGO.com. This means that a remote driver is not needed to run this function, so if you only care about checking a LEGO set’s inventory status on the online store, you only need to run ProductOnline. All you need to provide is the set URL.

ProductOnline <- function(link) {
  ...
}

All of the data is JSON-formatted in the HTML under a script container. This part of the function retrieves the data and converts it to a data frame

scripts <- link %>%
  read_html() %>%
  html_nodes("script")

for (j in 1:length(scripts)) {
  extracted_json <- scripts %>%
    .[j] %>%
    html_text(trim = TRUE)
  
  if (grepl("^window.__", extracted_json)) {
    break
  }
  
}

object_json <- extracted_json %>%
  gsub("^window.*window.*__=|[;]$", "", .) %>%
  fromJSON(simplifyDataFrame = TRUE) %>%
  {.$apolloState$data}

The next several lines retrieve only the pieces of data I deem necessary. There is a lot more data in the object_json variable, but this is all I would theoretically need as a LEGO customer.

avl_online <- object_json %>% 
  {.[grepl("ProductVariant.*attributes", names(.))][[1]]} %>%
  .[["canAddToBag"]]
avl_details <- object_json %>% 
  {.[grepl("ProductVariant.*attributes", names(.))][[1]]} %>%
  .[["availabilityText"]]
order_limit <- object_json %>% 
  {.[grepl("ProductVariant.*attributes", names(.))][[1]]} %>%
  .[["maxOrderQuantity"]]
price <- object_json %>% 
  {.[grepl("ProductVariant.*listPrice", names(.))][[1]]} %>%
  .[["formattedValue"]]
vip_pts <- object_json %>% 
  {.[grepl("^ProductVariant", names(.))][[1]]} %>%
  .[["vipPoints"]]
product_name <- object_json %>% 
  {.[grepl("SingleVariantProduct", names(.))][[1]]} %>%
  .[["name"]] %>%
  gsub("\231","",.) # removes TM symbol
product_code <- object_json %>%
  {.[grepl("SingleVariantProduct", names(.))][[1]]} %>%
  .[["productCode"]]

The final part of the function is to return all the data as a data frame. I also include the item location type, the the location itself, and the time I retrieved the data.

return(
  data.frame(
    "ITEM" = product_code,
    "NAME" = product_name,
    "PRICE" = price,
    "VIP" = vip_pts,
    "TYPE" = "Online",
    "SOURCE" = "LEGO.com",
    "INSTOCK" = avl_online,
    "DETAILS" = avl_details,
    "ADDRESS" = link,
    "TIMESTAMP" = if_else(avl_online, Sys.time(), as.POSIXct(NA))
  )
)

StoreStatus

This function grabs the inventory status and other information for each store and returns the data in the form of a data frame. It is very similar to the ProductOnline function, but we have to use tools from RSelenium to get in-store stock details.

StoreStatus <- function() {
  store_name <- remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-preview-name"]')$
    getElementText() %>% unlist()
  store_status <- remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-preview-status"]')$
    getElementText() %>% unlist()
  store_address <- remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-preview-address"]')$
    getElementText() %>% unlist()
  
  store_bool <- grepl("in stock", store_status, ignore.case = TRUE)
  
  return(
    data.frame(
      "TYPE" = "Retail",
      "SOURCE" = store_name,
      "INSTOCK" = store_bool,
      "DETAILS" = store_status,
      "ADDRESS" = store_address,
      "TIMESTAMP" = if_else(store_bool, Sys.time(), as.POSIXct(NA))
    )
  )
}

StoreData

This function cycles through a given number of LEGO stores and retrieves each store’s inventory status and information using the StoreData function. Note that the while loop will not run if only one store’s data is requested.

StoreData <- function(store_num) {
  store_i <- 1
  df <- StoreStatus()
  
  while (store_i < store_num) {
    Sys.sleep(0.5)
    remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-selected-store"]')$clickElement()
    Sys.sleep(0.5)
    remDr$findElement(
      using = "xpath",
      paste0('//div[@data-test = "store-inventory-store-unselected"][',store_i,']')
    )$clickElement()
    store_i <- store_i + 1
    
    df <- rbind(df, StoreStatus())
  }
  
  return(df)
}

TotalData

When visiting a LEGO set’s webpage, TotalData uses both ProductOnline and StoreData to get the set’s inventory status from the online store and the nearest LEGO stores.

TotalData <- function(link, store_num) {
  df <- bind_rows(ProductOnline(link), StoreData(store_num))
  df[c("ITEM", "NAME", "PRICE", "VIP")] <- df[1, c("ITEM", "NAME", "PRICE", "VIP")]
  
  return(df)
}

MasterScraper

This is the function that combines everything I need from LEGO.com into one function. It creates a remote driver to mimic a physical interaction with LEGO.com without the user doing any work. The only arguments for MasterScraper are an array of URLs for each LEGO set, my zip code, and the number of nearest LEGO stores I want to search. Note that I have set the default number of nearest stores to three.

MasterScraper <- function(links, zip_code, store_num=3) {
  ...
}

The first thing the MasterScraper function does is clear any existing instances of the remote driver from the previous run by calling ResetPorts.

ResetPorts()

The next step is to initialize and check some varaibles. Along with creating an indexing variable, the number of stores may be corrected if the user inputted a number greater than five. This is because LEGO.com only shows the five closest stores for a given zip code. Also, LEGO stores tend to be so spread out that the fifth-closest store could easily be over 100 miles away. MasterScraper may take a few minutes to run depending on the number of LEGO sets the user is searching for, so a progress bar is nice to have.

link_i <- 1
if (store_num > 5) {store_num <- 5}
pb <- txtProgressBar(min=0, max=length(links), style=3)

The rsDriver function from RSelenium creates the remote driver for our scraping needs. The final argument makes the remote driver headless, which means I won’t see it running. When debugging, I recommend commenting the final argument out so that you can see how your code is interacting with the browser.

rD <- rsDriver(
  browser = "chrome",
  chromever = "87.0.4280.88",
  verbose = FALSE,
  extraCapabilities = list("chromeOptions" = list(args = list('--headless')))
)
remDr <<- rD$client

Now that the remote driver has been created, I can start retrieving data on my LEGO sets.

remDr$navigate(links[link_i])
Sys.sleep(2)

Upon visiting LEGO.com, a pop-up appears asking the if user wants to continue to the shopping website or the play zone.

These next two lines will click the Continue button as well as the Accept Cookies button behind the popup to get to the LEGO set webpage.

remDr$findElement(using = "xpath", '//button[@data-test = "age-gate-grown-up-cta"]')$clickElement()
remDr$findElement(using = "xpath", '//button[@data-test = "cookie-banner-normal-button"]')$clickElement()

This part of the function will check the LEGO set’s inventory status at the nearest LEGO store locations based on the given zip code. I am very fortunate to work relatively close to three LEGO stores, so MasterScraper will retrieve inventory data from those three locations as well as its online availability using the TotalData function.

remDr$findElement(using = "xpath", XPathPartialClass("stock-accordion", "button", "data-test"))$clickElement()

remDr$findElement(
  using = "xpath",
  '//input[@data-test = "input-with-button-input"]'
)$sendKeysToElement(list(zip_code))
remDr$findElement(using = "xpath", '//button[@data-test = "input-with-button-button"]')$clickElement()
Sys.sleep(0.5)

df <- TotalData(links[link_i], store_num)
setTxtProgressBar(pb, link_i)

If there are multiple LEGO sets, the data retrieval process is repeated for the remaining sets.

while (link_i < length(links)) {
  link_i <- link_i + 1
  
  remDr$navigate(links[link_i])
  Sys.sleep(0.5)
  remDr$findElement(using = "xpath",
                    XPathPartialClass("stock-accordion", "button", "data-test"))$clickElement()
  Sys.sleep(1.0)
  
  df <- rbind(df, TotalData(links[link_i], store_num))
  setTxtProgressBar(pb, link_i)
}

Now that all of the data has been gathered, it is time to close the remote driver. This is very important so that stale instances are not left open over time.

remDr$quit()

Finally, the data frame is modified so that all factor columns are converted to character columns.

df <- df %>%
  mutate_if(is.factor, as.character)

return(df)

Main Code

Thanks to all of the functions, the rest of the code takes up only a few lines.

Data Acquisition

For this code demonstration, I will retrieve inventory data for three LEGO sets. These sets include the very popular 501st battle pack from August 2020, the new X-Wing from January 2021, and the new TIE Fighter also from January 2021. As previously mentioned, the 501st battle pack was a very tough set to get for several months following its release; while the X-Wing and TIE Fighter don’t seem to have the same level of demand and fandom as the 501st battle pack, they can still be hard to find in stores.

Once I add the LEGO set URLs along with my zip code and the number of stores to search for into MasterScraper, the function will run and return the product data.

links <- c(
  "https://www.lego.com/en-us/product/501st-legion-clone-troopers-75280",
  "https://www.lego.com/en-us/product/luke-skywalker-s-x-wing-fighter-75301",
  "https://www.lego.com/en-us/product/imperial-tie-fighter-75300"
)

products_df <- MasterScraper(links, zip_code = "92612", store_num = 3)

That’s it! I now have inventory status data for the 501st battle pack, the X-Wing, and the TIE Fighter.

Data Filtering

Here is a look at the data I have gathered from LEGO.com.

products_df
ITEM NAME PRICE VIP TYPE SOURCE INSTOCK DETAILS ADDRESS TIMESTAMP
75280 501st Legion Clone Troopers 29.99 195 Online LEGO.com TRUE Available now https://www.lego.com/en-us/product/501st-legion-clone-troopers-75280 2021-01-31 19:04:06
75280 501st Legion Clone Troopers 29.99 195 Retail LEGO® Store South Coast Plaza FALSE Out of Stock South Coast Plaza, 3333 Bristol Street #1042, Costa Mesa, CA 92626 NA
75280 501st Legion Clone Troopers 29.99 195 Retail LEGO® Store The Shops At Mission Viejo TRUE In Stock at this time 555 The Shops at Mission Viejo, The Shops at Mission Viejo, Space 428b, Mission Viejo, CA 92691 2021-01-31 19:04:08
75280 501st Legion Clone Troopers 29.99 195 Retail LEGO® Store Downtown Disney® District TRUE In Stock at this time 1585 Disneyland Drive, Anaheim, CA 92802 2021-01-31 19:04:09
75301 Luke Skywalker’s X-Wing Fighter 49.99 325 Online LEGO.com TRUE Backorders accepted, will ship in 60 days https://www.lego.com/en-us/product/luke-skywalker-s-x-wing-fighter-75301 2021-01-31 19:04:15
75301 Luke Skywalker’s X-Wing Fighter 49.99 325 Retail LEGO® Store South Coast Plaza FALSE Out of Stock South Coast Plaza, 3333 Bristol Street #1042, Costa Mesa, CA 92626 NA
75301 Luke Skywalker’s X-Wing Fighter 49.99 325 Retail LEGO® Store The Shops At Mission Viejo FALSE Out of Stock 555 The Shops at Mission Viejo, The Shops at Mission Viejo, Space 428b, Mission Viejo, CA 92691 NA
75301 Luke Skywalker’s X-Wing Fighter 49.99 325 Retail LEGO® Store Downtown Disney® District FALSE Out of Stock 1585 Disneyland Drive, Anaheim, CA 92802 NA
75300 Imperial TIE Fighter 39.99 260 Online LEGO.com TRUE Available now https://www.lego.com/en-us/product/imperial-tie-fighter-75300 2021-01-31 19:04:23
75300 Imperial TIE Fighter 39.99 260 Retail LEGO® Store South Coast Plaza FALSE Out of Stock South Coast Plaza, 3333 Bristol Street #1042, Costa Mesa, CA 92626 NA
75300 Imperial TIE Fighter 39.99 260 Retail LEGO® Store The Shops At Mission Viejo FALSE Out of Stock 555 The Shops at Mission Viejo, The Shops at Mission Viejo, Space 428b, Mission Viejo, CA 92691 NA
75300 Imperial TIE Fighter 39.99 260 Retail LEGO® Store Downtown Disney® District FALSE Out of Stock 1585 Disneyland Drive, Anaheim, CA 92802 NA

I can filter this table to show only LEGO sets that are in stock at their respective locations.

products_df %>%
  filter(INSTOCK)
ITEM NAME PRICE VIP TYPE SOURCE INSTOCK DETAILS ADDRESS TIMESTAMP
75280 501st Legion Clone Troopers 29.99 195 Online LEGO.com TRUE Available now https://www.lego.com/en-us/product/501st-legion-clone-troopers-75280 2021-01-31 19:04:06
75280 501st Legion Clone Troopers 29.99 195 Retail LEGO® Store The Shops At Mission Viejo TRUE In Stock at this time 555 The Shops at Mission Viejo, The Shops at Mission Viejo, Space 428b, Mission Viejo, CA 92691 2021-01-31 19:04:08
75280 501st Legion Clone Troopers 29.99 195 Retail LEGO® Store Downtown Disney® District TRUE In Stock at this time 1585 Disneyland Drive, Anaheim, CA 92802 2021-01-31 19:04:09
75301 Luke Skywalker’s X-Wing Fighter 49.99 325 Online LEGO.com TRUE Backorders accepted, will ship in 60 days https://www.lego.com/en-us/product/luke-skywalker-s-x-wing-fighter-75301 2021-01-31 19:04:15
75300 Imperial TIE Fighter 39.99 260 Online LEGO.com TRUE Available now https://www.lego.com/en-us/product/imperial-tie-fighter-75300 2021-01-31 19:04:23

Due to restrictions from the pandemic, some may not want to or be able to go to physical LEGO stores. I can filter this data to get all LEGO sets that are in stock online.

products_df %>%
  filter(INSTOCK, TYPE=="Online")
ITEM NAME PRICE VIP TYPE SOURCE INSTOCK DETAILS ADDRESS TIMESTAMP
75280 501st Legion Clone Troopers 29.99 195 Online LEGO.com TRUE Available now https://www.lego.com/en-us/product/501st-legion-clone-troopers-75280 2021-01-31 19:04:06
75301 Luke Skywalker’s X-Wing Fighter 49.99 325 Online LEGO.com TRUE Backorders accepted, will ship in 60 days https://www.lego.com/en-us/product/luke-skywalker-s-x-wing-fighter-75301 2021-01-31 19:04:15
75300 Imperial TIE Fighter 39.99 260 Online LEGO.com TRUE Available now https://www.lego.com/en-us/product/imperial-tie-fighter-75300 2021-01-31 19:04:23

Once I get the desired filtered data, I can save it to a csv file and have a copy of the data.

Final Thoughts

I hope you found this helpful with whatever project you’re working on. I recommend automating this script with Task Scheduler (on Windows) so that you really don’t have to do any work to check the inventory status for your favorite LEGO sets; at any point in time, you could open the csv file and know which sets are in stock. Going one step further, you can turn this script into some sort of bot to notify you when a LEGO set is in stock. There are certainly more ways to optimize this process, but this code should provide at least a good starting point with regards to data acquisition.

Since I created this script, I have already been able to snag a couple LEGO sets online and at my local LEGO store. On some occassions, a LEGO set became out of stock within 24 hours of me purchasing it. Now I no longer have to worry about LEGO sets going out of stock or missing out on an opportunity, and I hope you won’t have to worry either.