Introduction
When I was a kid, I went through the same LEGO phase as most other kids. LEGOs were always on my birthday and Christmas wish lists. I loved coming home from school and finding the monthly LEGO catalog in the mail, where I would spend the next hour scanning each page for new sets that piqued my interest. Only the best sets would make it onto my birthday and Christmas wish lists, and it was up to my parents to figure out the logistics of getting these seemingly high-demand sets. While I did not receive every LEGO set I asked for, I still have fond memories building and playing with the ones I did receive.
Like most kids, I slowly grew out of LEGO. Nothing changed with LEGO; I was just getting older and no longer a part of the target demographic (it’s not you, it’s me). However, after many years and a degree in mechanical engineering, I have come back to LEGO. I have a new appreciation for the creative and advanced building techniques while also having a soft spot for the nostalgia of opening a box and building my toy. But now that I’m older, I understand the trouble my parents went through in finding some of these LEGO sets.
The final season of the animated TV show Star Wars: The Clone Wars aired during the beginning of the COVID-19 pandemic, and the ending was nothing short of spectacular. Some of the characters from the series were at their peak in popularity, and LEGO capitalized on this opportunity by releasing a 501st battle pack and an AAT featuring Ahsoka Tano and a 332nd clone trooper just a few months after the final episode aired. These sets were so popular among Star Wars fans that they were nearly impossible to get for several months. In fact, LEGO had to impose a limit of one-per-person due to the insane demand.
This frenzy is similar to the PlayStation 5 release in that it was/is nearly impossible to get your hands on one. Some very determined people used bots to purchase a PlayStation 5 whenever it was in stock so that they didn’t have to check multiple times a day for an inventory update. This inspired me to create something similar that would check online and in-store inventory for LEGO sets I wanted, and I believe I was able to do just that (I didn’t want to automatically purchase LEGO sets as I feel that would be unethical).
This blog serves as a walkthrough on how to get inventory status data for LEGO sets without visiting LEGO.com.
Functions
The vast majority of code in the script is for functions. In this section, I will outline each function and describe their intended purpose.
XPathPartialClass
All data from a webpage is stored in some container. Most of these containers have attributes that specify the design and format of the web element. Some of the names for these attributes can be quite lengthy and specific, which can be problematic when trying to maintain a robust script. The XPathPartialClass
function gives me the option to only provide as much of the attribute name as I want without using the whole name.
For example, one of the class names on the LEGO website is StoreCheckerstyles__StoreSelected-kr3ej5-3 ejSejf. Instead of searching all nodes for that string, I can simply enter StoreSelected as an argument in the XPathPartialClass
function. This will get me what I want (assuming StoreSelected is unique) without relying on the seemingly random alphanumeric characters in the class name.
XPathPartialClass <- function(name, container = "div", attribute = "class") {
return(
paste0(
'//',
container,
'[@',
attribute,
' and contains(concat(" ", normalize-space(@',
attribute,
'), " "),"',
name,
'")]'
)
)
}
ResetPorts
This script will throw an error if the remote driver attempts to occupy a port that is already in use. ResetPorts
frees up the ports to minimize this error.
ResetPorts <- function() {
invisible(capture.output(
system(
"taskkill /im java.exe /f",
intern = FALSE,
ignore.stdout = FALSE,
show.output.on.console = FALSE
)
))
}
ProductOnline
Any data regarding item information–including its online availability–can be accessed without physically visiting LEGO.com. This means that a remote driver is not needed to run this function, so if you only care about checking a LEGO set’s inventory status on the online store, you only need to run ProductOnline
. All you need to provide is the set URL.
ProductOnline <- function(link) {
...
}
All of the data is JSON-formatted in the HTML under a script container. This part of the function retrieves the data and converts it to a data frame
scripts <- link %>%
read_html() %>%
html_nodes("script")
for (j in 1:length(scripts)) {
extracted_json <- scripts %>%
.[j] %>%
html_text(trim = TRUE)
if (grepl("^window.__", extracted_json)) {
break
}
}
object_json <- extracted_json %>%
gsub("^window.*window.*__=|[;]$", "", .) %>%
fromJSON(simplifyDataFrame = TRUE) %>%
{.$apolloState$data}
The next several lines retrieve only the pieces of data I deem necessary. There is a lot more data in the object_json
variable, but this is all I would theoretically need as a LEGO customer.
avl_online <- object_json %>%
{.[grepl("ProductVariant.*attributes", names(.))][[1]]} %>%
.[["canAddToBag"]]
avl_details <- object_json %>%
{.[grepl("ProductVariant.*attributes", names(.))][[1]]} %>%
.[["availabilityText"]]
order_limit <- object_json %>%
{.[grepl("ProductVariant.*attributes", names(.))][[1]]} %>%
.[["maxOrderQuantity"]]
price <- object_json %>%
{.[grepl("ProductVariant.*listPrice", names(.))][[1]]} %>%
.[["formattedValue"]]
vip_pts <- object_json %>%
{.[grepl("^ProductVariant", names(.))][[1]]} %>%
.[["vipPoints"]]
product_name <- object_json %>%
{.[grepl("SingleVariantProduct", names(.))][[1]]} %>%
.[["name"]] %>%
gsub("\231","",.) # removes TM symbol
product_code <- object_json %>%
{.[grepl("SingleVariantProduct", names(.))][[1]]} %>%
.[["productCode"]]
The final part of the function is to return all the data as a data frame. I also include the item location type, the the location itself, and the time I retrieved the data.
return(
data.frame(
"ITEM" = product_code,
"NAME" = product_name,
"PRICE" = price,
"VIP" = vip_pts,
"TYPE" = "Online",
"SOURCE" = "LEGO.com",
"INSTOCK" = avl_online,
"DETAILS" = avl_details,
"ADDRESS" = link,
"TIMESTAMP" = if_else(avl_online, Sys.time(), as.POSIXct(NA))
)
)
StoreStatus
This function grabs the inventory status and other information for each store and returns the data in the form of a data frame. It is very similar to the ProductOnline
function, but we have to use tools from RSelenium
to get in-store stock details.
StoreStatus <- function() {
store_name <- remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-preview-name"]')$
getElementText() %>% unlist()
store_status <- remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-preview-status"]')$
getElementText() %>% unlist()
store_address <- remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-preview-address"]')$
getElementText() %>% unlist()
store_bool <- grepl("in stock", store_status, ignore.case = TRUE)
return(
data.frame(
"TYPE" = "Retail",
"SOURCE" = store_name,
"INSTOCK" = store_bool,
"DETAILS" = store_status,
"ADDRESS" = store_address,
"TIMESTAMP" = if_else(store_bool, Sys.time(), as.POSIXct(NA))
)
)
}
StoreData
This function cycles through a given number of LEGO stores and retrieves each store’s inventory status and information using the StoreData
function. Note that the while loop will not run if only one store’s data is requested.
StoreData <- function(store_num) {
store_i <- 1
df <- StoreStatus()
while (store_i < store_num) {
Sys.sleep(0.5)
remDr$findElement(using = "xpath", '//div[@data-test = "store-inventory-selected-store"]')$clickElement()
Sys.sleep(0.5)
remDr$findElement(
using = "xpath",
paste0('//div[@data-test = "store-inventory-store-unselected"][',store_i,']')
)$clickElement()
store_i <- store_i + 1
df <- rbind(df, StoreStatus())
}
return(df)
}
TotalData
When visiting a LEGO set’s webpage, TotalData
uses both ProductOnline
and StoreData
to get the set’s inventory status from the online store and the nearest LEGO stores.
TotalData <- function(link, store_num) {
df <- bind_rows(ProductOnline(link), StoreData(store_num))
df[c("ITEM", "NAME", "PRICE", "VIP")] <- df[1, c("ITEM", "NAME", "PRICE", "VIP")]
return(df)
}
MasterScraper
This is the function that combines everything I need from LEGO.com into one function. It creates a remote driver to mimic a physical interaction with LEGO.com without the user doing any work. The only arguments for MasterScraper
are an array of URLs for each LEGO set, my zip code, and the number of nearest LEGO stores I want to search. Note that I have set the default number of nearest stores to three.
MasterScraper <- function(links, zip_code, store_num=3) {
...
}
The first thing the MasterScraper
function does is clear any existing instances of the remote driver from the previous run by calling ResetPorts
.
ResetPorts()
The next step is to initialize and check some varaibles. Along with creating an indexing variable, the number of stores may be corrected if the user inputted a number greater than five. This is because LEGO.com only shows the five closest stores for a given zip code. Also, LEGO stores tend to be so spread out that the fifth-closest store could easily be over 100 miles away. MasterScraper
may take a few minutes to run depending on the number of LEGO sets the user is searching for, so a progress bar is nice to have.
link_i <- 1
if (store_num > 5) {store_num <- 5}
pb <- txtProgressBar(min=0, max=length(links), style=3)
The rsDriver
function from RSelenium
creates the remote driver for our scraping needs. The final argument makes the remote driver headless, which means I won’t see it running. When debugging, I recommend commenting the final argument out so that you can see how your code is interacting with the browser.
rD <- rsDriver(
browser = "chrome",
chromever = "87.0.4280.88",
verbose = FALSE,
extraCapabilities = list("chromeOptions" = list(args = list('--headless')))
)
remDr <<- rD$client
Now that the remote driver has been created, I can start retrieving data on my LEGO sets.
remDr$navigate(links[link_i])
Sys.sleep(2)
Upon visiting LEGO.com, a pop-up appears asking the if user wants to continue to the shopping website or the play zone.
These next two lines will click the Continue button as well as the Accept Cookies button behind the popup to get to the LEGO set webpage.
remDr$findElement(using = "xpath", '//button[@data-test = "age-gate-grown-up-cta"]')$clickElement()
remDr$findElement(using = "xpath", '//button[@data-test = "cookie-banner-normal-button"]')$clickElement()
This part of the function will check the LEGO set’s inventory status at the nearest LEGO store locations based on the given zip code. I am very fortunate to work relatively close to three LEGO stores, so MasterScraper
will retrieve inventory data from those three locations as well as its online availability using the TotalData
function.
remDr$findElement(using = "xpath", XPathPartialClass("stock-accordion", "button", "data-test"))$clickElement()
remDr$findElement(
using = "xpath",
'//input[@data-test = "input-with-button-input"]'
)$sendKeysToElement(list(zip_code))
remDr$findElement(using = "xpath", '//button[@data-test = "input-with-button-button"]')$clickElement()
Sys.sleep(0.5)
df <- TotalData(links[link_i], store_num)
setTxtProgressBar(pb, link_i)
If there are multiple LEGO sets, the data retrieval process is repeated for the remaining sets.
while (link_i < length(links)) {
link_i <- link_i + 1
remDr$navigate(links[link_i])
Sys.sleep(0.5)
remDr$findElement(using = "xpath",
XPathPartialClass("stock-accordion", "button", "data-test"))$clickElement()
Sys.sleep(1.0)
df <- rbind(df, TotalData(links[link_i], store_num))
setTxtProgressBar(pb, link_i)
}
Now that all of the data has been gathered, it is time to close the remote driver. This is very important so that stale instances are not left open over time.
remDr$quit()
Finally, the data frame is modified so that all factor columns are converted to character columns.
df <- df %>%
mutate_if(is.factor, as.character)
return(df)
Main Code
Thanks to all of the functions, the rest of the code takes up only a few lines.
Data Acquisition
For this code demonstration, I will retrieve inventory data for three LEGO sets. These sets include the very popular 501st battle pack from August 2020, the new X-Wing from January 2021, and the new TIE Fighter also from January 2021. As previously mentioned, the 501st battle pack was a very tough set to get for several months following its release; while the X-Wing and TIE Fighter don’t seem to have the same level of demand and fandom as the 501st battle pack, they can still be hard to find in stores.
Once I add the LEGO set URLs along with my zip code and the number of stores to search for into MasterScraper
, the function will run and return the product data.
links <- c(
"https://www.lego.com/en-us/product/501st-legion-clone-troopers-75280",
"https://www.lego.com/en-us/product/luke-skywalker-s-x-wing-fighter-75301",
"https://www.lego.com/en-us/product/imperial-tie-fighter-75300"
)
products_df <- MasterScraper(links, zip_code = "92612", store_num = 3)
That’s it! I now have inventory status data for the 501st battle pack, the X-Wing, and the TIE Fighter.
Data Filtering
Here is a look at the data I have gathered from LEGO.com.
products_df
ITEM
|
NAME
|
PRICE
|
VIP
|
TYPE
|
SOURCE
|
INSTOCK
|
DETAILS
|
ADDRESS
|
TIMESTAMP
|
75280
|
501st Legion Clone Troopers
|
29.99
|
195
|
Online
|
LEGO.com
|
TRUE
|
Available now
|
https://www.lego.com/en-us/product/501st-legion-clone-troopers-75280
|
2021-01-31 19:04:06
|
75280
|
501st Legion Clone Troopers
|
29.99
|
195
|
Retail
|
LEGO® Store South Coast Plaza
|
FALSE
|
Out of Stock
|
South Coast Plaza, 3333 Bristol Street #1042, Costa Mesa, CA 92626
|
NA
|
75280
|
501st Legion Clone Troopers
|
29.99
|
195
|
Retail
|
LEGO® Store The Shops At Mission Viejo
|
TRUE
|
In Stock at this time
|
555 The Shops at Mission Viejo, The Shops at Mission Viejo, Space 428b, Mission Viejo, CA 92691
|
2021-01-31 19:04:08
|
75280
|
501st Legion Clone Troopers
|
29.99
|
195
|
Retail
|
LEGO® Store Downtown Disney® District
|
TRUE
|
In Stock at this time
|
1585 Disneyland Drive, Anaheim, CA 92802
|
2021-01-31 19:04:09
|
75301
|
Luke Skywalker’s X-Wing Fighter
|
49.99
|
325
|
Online
|
LEGO.com
|
TRUE
|
Backorders accepted, will ship in 60 days
|
https://www.lego.com/en-us/product/luke-skywalker-s-x-wing-fighter-75301
|
2021-01-31 19:04:15
|
75301
|
Luke Skywalker’s X-Wing Fighter
|
49.99
|
325
|
Retail
|
LEGO® Store South Coast Plaza
|
FALSE
|
Out of Stock
|
South Coast Plaza, 3333 Bristol Street #1042, Costa Mesa, CA 92626
|
NA
|
75301
|
Luke Skywalker’s X-Wing Fighter
|
49.99
|
325
|
Retail
|
LEGO® Store The Shops At Mission Viejo
|
FALSE
|
Out of Stock
|
555 The Shops at Mission Viejo, The Shops at Mission Viejo, Space 428b, Mission Viejo, CA 92691
|
NA
|
75301
|
Luke Skywalker’s X-Wing Fighter
|
49.99
|
325
|
Retail
|
LEGO® Store Downtown Disney® District
|
FALSE
|
Out of Stock
|
1585 Disneyland Drive, Anaheim, CA 92802
|
NA
|
75300
|
Imperial TIE Fighter
|
39.99
|
260
|
Online
|
LEGO.com
|
TRUE
|
Available now
|
https://www.lego.com/en-us/product/imperial-tie-fighter-75300
|
2021-01-31 19:04:23
|
75300
|
Imperial TIE Fighter
|
39.99
|
260
|
Retail
|
LEGO® Store South Coast Plaza
|
FALSE
|
Out of Stock
|
South Coast Plaza, 3333 Bristol Street #1042, Costa Mesa, CA 92626
|
NA
|
75300
|
Imperial TIE Fighter
|
39.99
|
260
|
Retail
|
LEGO® Store The Shops At Mission Viejo
|
FALSE
|
Out of Stock
|
555 The Shops at Mission Viejo, The Shops at Mission Viejo, Space 428b, Mission Viejo, CA 92691
|
NA
|
75300
|
Imperial TIE Fighter
|
39.99
|
260
|
Retail
|
LEGO® Store Downtown Disney® District
|
FALSE
|
Out of Stock
|
1585 Disneyland Drive, Anaheim, CA 92802
|
NA
|
I can filter this table to show only LEGO sets that are in stock at their respective locations.
products_df %>%
filter(INSTOCK)
Due to restrictions from the pandemic, some may not want to or be able to go to physical LEGO stores. I can filter this data to get all LEGO sets that are in stock online.
products_df %>%
filter(INSTOCK, TYPE=="Online")
Once I get the desired filtered data, I can save it to a csv file and have a copy of the data.