Sheet Scraper
Project Overview
I wrote this program to help the FRC shirt trading community by making a spreadsheet that is a combination of members' spreadsheets. If you are not familiar with FRC or shirt trading, you probably did not understand a word of that sentence, so I will give a brief overview.
The code shown here has been modified to make it easier to explain. If you would like to check it out, the full code is available on GitHub: Sheet Scraper. If you would like to see the final result, here is a link to the spreadsheet.
FRC
First Robotics Competition is a high school competitive robotics league. It has around +3500 teams across +25 countries. Teams will design, build, wire, and program a robot to compete in 3v3 matches against an opposing team. The way you score points each year is different (Here is an example of the 2022 game).
Shirt Trading
FRC is modeled like conventional sports, teams have mascots and team shirts. FRC does differ from conventional sports because it promotes cooperation and gracious professionalism at competitions which results in collaboration between teams. One example of this collaboration is shirt trading. Teams will make extra shirts to trade at competitions or online. Shirts have different values depending on the team/s that are on them and how cool they are. There is a Discord server for shirt trading with channels for posting spreadsheets of shirt inventories.
Requirements
Now let's talk about the requirements to run this scraper:
Blue alliance is used to get the name of a team when only the number has been provided
A discord bot is used to scrape the spreadsheet data from the FRC Shirt Trading discord server
Used to get data from the spreadsheets and to write data into the combined spreadsheet
Getting Links From Discord
Text Channel
The first step in creating a combined spreadsheet is getting all of the individual sheets:
def retrieve_messages(channelID):
headers = {
"authorization": "Bot " + str(keys.DISCORD_AUTH)
}
r = requests.get(f"https://discord.com/api/v9/channels/{channelID}/messages?limit=100", headers=headers)
data = json.loads(r.text)
return data
def get_sheet_id(data):
ids = []
user = []
for value in data:
try:
id = value["content"].split("/d/")
id[1] = id[1].split("/edit")
ids.append(id[1][0])
user.append(value['author']['username'])
except:
...
return (ids, user)
def get_ids():
return get_sheet_id(retrieve_messages(keys.CHANNEL_ID))
Let's start with retrieve_messages
. We pass in the Discord channel we want to get the messages of and using requests we get the channel data from the Discord API. Next, we send all of the channel data to get_sheet_id
where we loop through every message and extract the spreadsheet ID if it exists.
Forums Channel
Recently the FRC Trading Discord has updated their spreadsheet channel to convert it into a forums channel, so our code needs to reflect the change:
client = discord.Client(intents=discord.Intents.default())
messages = []
users = []
@client.event
async def on_ready():
firstTime = True
for guild in client.guilds:
for channel in guild.channels:
if channel.name == "spreadsheets":
if firstTime:
firstTime = False
continue
for thread in channel.threads:
async for message in thread.history(limit=100):
if "spreadsheets/d/" in message.content:
id = message.content.split("/d/")
id[1] = id[1].split("/edit")
messages.append(id[1][0])
users.append(message.author.name)
break
break
await client.close()
def get_forum_ids():
client.run(keys.DISCORD_AUTH)
return (messages, users)
To get the spreadsheet IDs from a forums channel we need to access it with Discord's Python library. Here we create the discord client and when we get the IDs from get_forum_ids
we run a discord bot to get every post from the forums channel and any spreadsheet IDs they contain.
Getting the Spreadsheets
We use the Google Sheets API to get spreadsheet data.
def get_sheet(id):
result = service.spreadsheets().get(spreadsheetId=id, includeGridData=True).execute()
result = service.spreadsheets().values().get(spreadsheetId=id, range=result['sheets'][0]['properties']['title'], majorDimension="ROWS").execute()
if "values" in result:
return result["values"]
else:
return -1
We use the first request to get the title of the first spreadsheet and the second request to get the contents with a check to make sure that the contents exist.
Parsing The Spreadsheet
Our first step is to take a spreadsheet and locate the different categories.
In this example, we can see we have categories for team number, team name, shirt type, shirt size, and how tradable it is. Now let's find each of those categories:
start = -1
for _ in range(10):
category_locations = {category:-1 for category in CATEGORIES}
for row in range(start + 1, len(sheet)):
for col in range(len(sheet[row])):
if check_category(sheet[row][col]) != "none" and category_locations[check_category(sheet[row][col])] == -1:
category_locations[check_category(sheet[row][col])] = col
start = row
if start == row: # Category's need to be in a row
break
categoriesFound = sum(1 for _, category in category_locations.items() if category != -1)
if categoriesFound >= MIN_CATEGORIES:
break
Let's break it down. First, we initialize a start
variable that we will use to keep track of where the spreadsheet starts.
category_locations = {category:-1 for category in CATEGORIES}
for row in range(start + 1, len(sheet)):
for col in range(len(sheet[row])):
We create an empty dictionary to store the column numbers for each category. Next, we start to loop through every row of the spreadsheet but first, we need to figure out how to find out if a current index contains a category.
def check_category(val):
if val == "":
return "none"
val = val.lower()
if "name" in val and "username" not in val:
return "Name"
if "number" in val or ("#" in val and "team" in val):
return "Number"
if "size" in val:
return "Size"
if "type" in val or "item" in val:
return "Type"
if "year" in val:
return "Year"
if "description" in val or "info" in val or "details" in val or "decription" in val:
return "Description"
if "status" in val or "avail" in val:
return "Availability"
if "trad" in val or "rarity" in val or "likeliness" in val:
return "Tradability"
if "notes" in val or "other" in val or "comments" in val:
return "Notes"
if "team" in val:
return "Number"
return "none"
Here is the checkCategory
method we use to look for keywords in each index of the spreadsheet. Now let's look at how we loop through each row.
if check_category(sheet[row][col]) != "none" and category_locations[check_category(sheet[row][col])] == -1:
category_locations[check_category(sheet[row][col])] = col
start = row
At each index, we check to see if it contains a category by passing it into our check_category
method as long as it does not return "none"
it is a valid category. We also check to make sure that this category has not already been found. After those conditions have been met we save the category's column number and the current row number.
for row in range(start + 1, len(sheet)):
for col in range(len(sheet[row])):
if start == row:
break
After we are done looping through the row we check to see if the start has been reassigned which means that a row containing categories has been found. In every spreadsheet the categories are always in a row so there is no reason to continue looking. But what if the sheet has a header that accidentally contains a "category"?
categoriesFound = sum(1 for _, category in category_locations.items() if category != -1)
if categoriesFound >= MIN_CATEGORIES:
break
We use this to make sure that the category row we have found is the correct one. We add up all of the categories that don't have a default column and check to make sure we have found enough MIN_CATEGORIES = 3
.
for _ in range(10):
We loop through this 10 times so that we don't waste time on sheets that don't contain any categories. After we loop through it 10 times we rotate the sheet 90 degrees sheet = list(zip(*sheet[::-1]))
and loop through it another 10 times because some spreadsheets have their categories horizontal.
end = -1
for row in range(len(sheet)):
for col in range(len(sheet[row])):
val = str(sheet[row][col]).lower()
if "wish" in val or "wants" in val or "looking" in val:
end = row if all([col <= index for _, index in category_locations.items()]) else -1
break
if end != -1:
break
Finally, we want to locate the end of the spreadsheet. Sometimes users will put a wishlist beneath their inventory so we use this to make sure we don't get the inventory shirts mixed up with the wishlist. all([col <= index for _, index in category_locations.items()]) else -1
is used because sometimes the wishlist is to the right of the regular inventory and we don't want to end too soon. Wishlists are also part of the reason why we check if the category has not already been found category_locations[check_category(sheet[row][col])] == -1
.
Parsing the Inventories
Now that we know where each category is we can get all of the shirt data.
def parse_sheet(sheet, start, end, user, id, category_locations):
shirts = []
empty = 0
for row in range(start+1, len(sheet) if end == -1 else end):
shirt = {category:"" for category in CATEGORIES}
shirt["User"] = user
shirt["ID"] = id
for category, col in category_locations.items():
if col == -1 or col >= len(sheet[row]):
continue
shirt[category] = sheet[row][col]
if shirt["Name"] == "" and shirt["Number"] != "":
try:
shirt["Name"] = get_team_name(int(shirt["Number"]))
except:
...
numNotEmpty = sum(1 for _, col in shirt.items() if col != '')
if numNotEmpty >= MIN_SHIRT_DATA:
shirts.append(shirt)
empty = 0
empty += 1
if empty > MAX_EMPTY:
break
return shirts
First, we initialize shirts
and empty
which will keep track of each shirt we find and the number of consecutive empty rows respectively.
for row in range(start+1, len(sheet) if end == -1 else end):
shirt = {category:"" for category in CATEGORIES}
shirt["User"] = user
shirt["ID"] = id
Next, we loop through each row of the sheet from start
to end
(if it is not -1). We also initialize shirt
to be a dictionary of each category and we set the user and id categories.
for category, col in category_locations.items():
if col == -1 or col >= len(sheet[row]):
continue
shirt[category] = sheet[row][col]
In this section, we are looping through the category_locations
dictionary which contains each category and what column it is in. We check to make sure that the category has been found and that it is not out of range. Then we get the category value and assign it to the shirt
dictionary at the category index.
if shirt["Name"] == "" and shirt["Number"] != "":
try:
shirt["Name"] = get_team_name(int(shirt["Number"]))
except:
...
Here we check to see if we have a team number but no team name. We then try to typecast the shirt number to make sure it is of type int
. If the shirt number is an int then we can find the team name using the Blue Alliance API. The Blue Alliance API is an API from a website that displays the FRC team's info, rankings, and match record. We can use this to fill in the blanks when we are missing a team name.
def get_team_name(number):
try:
name = team_names[str(number)]
return name
except:
...
site = "https://www.thebluealliance.com/api/v3/team/frc"
api = {"X-TBA-Auth-Key": keys.BLUE_ALLIANCE}
request = requests.get(url=site+str(number), headers=api)
team_names[str(number)] = request.json()['nickname']
return request.json()['nickname']
Here we use a try catch loop to see if we already have the name cached. If we don't then we make a call to the API using the team number and then access the nickname from the JSON.
numNotEmpty = sum(1 for _, col in shirt.items() if col != '')
if numNotEmpty >= MIN_SHIRT_DATA:
shirts.append(shirt)
empty = 0
empty += 1
if empty > MAX_EMPTY:
break
Finally, before we append the shirt
dictionary to shirts
we need to verify it. We count up how many empty categories we have and if it i greater than the min MIN_SHIRT_DATA = 5
we add it to shirts
. If it is empty we increment empty
and if there are too many empty shirts in a row then we stop parsing the spreadsheet MAX_EMPTY = 10
.
Sorting
Once we have a complete list of all of the shirts we, need to sort them.
def sort_sheet(shirts):
nonInt = {}
currentInt = 999999999
for row in shirts:
try:
row[1] = int(row[1])
except:
nonInt[str(currentInt)] = row[1]
row[1] = currentInt
currentInt += 1
shirts.sort(key=lambda x: int(x[1]))
for row in shirts:
try:
if row[1] >= 999999999:
row[1] = nonInt[str(row[1])]
except Exception as e:
print(str(e))
return shirts
Because there is no guarantee that the number column of the spreadsheet is a valid integer we need to reassign it so we can sort it numerically by team number.
for row in shirts:
try:
row[1] = int(row[1])
except:
nonInt[str(currentInt)] = row[1]
row[1] = currentInt
currentInt += 1
Here we check to see if the current index is an integer, if not we assign it a unique value so we can sort everything later.
shirts.sort(key=lambda x: int(x[1]))
for row in shirts:
try:
if row[1] >= 999999999:
row[1] = nonInt[str(row[1])]
except Exception as e:
print(str(e))
Now we sort the shirts and reassign them using the placeholder values.
Saving Results
Now that we have a complete sorted list of every shirt we can save them to our results spreadsheet.
def write_result(sheet, id):
CATEGORIES.append("User")
CATEGORIES.append("ID")
sheet.insert(0, CATEGORIES)
values = {"majorDimension": "ROWS", "range": "Sheet1", "values": sheet}
result = service.spreadsheets().values().update(spreadsheetId=id, range="Sheet1", valueInputOption="RAW", body=values).execute()
format = {'requests': [
{'updateSheetProperties': {
'properties': {'gridProperties': {'frozenRowCount': 1}},
'fields': 'gridProperties.frozenRowCount',
}},
{'repeatCell': {
'range': {'endRowIndex': 1},
'cell': {'userEnteredFormat': {'textFormat': {'bold': True}}},
'fields': 'userEnteredFormat.textFormat.bold',
}},
]}
service.spreadsheets().batchUpdate(spreadsheetId=id, body=format).execute()
return result
Here we add User
and ID
to the categories and update the results sheet. We then format the sheet so the category row is frozen and bold.
Final Thoughts
This was a fun project. There was definitely some difficulty with creating algorithms to work for any spreadsheet as there was a lot of variety between them. Overall I had fun creating the algorithms and I hope you had fun reading about it.
BONUS counting the number of occurrences each shirt has
We are not done yet. Let's calculate the number of times each shirt appears in the result spreadsheet.
sheet = get_sheet(RESULT)
numberOfShirts = {}
del sheet[0]
for shirt in sheet:
if str(shirt[1]) in numberOfShirts:
numberOfShirts[str(shirt[1])] += 1
else:
numberOfShirts[str(shirt[1])] = 1
sheet = [[team, num] for team, num in numberOfShirts.items()]
values = {"majorDimension": "ROWS", "range": "Count", "values": sheet}
result = service.spreadsheets().values().update(spreadsheetId=RESULT, range="Count", valueInputOption="RAW", body=values).execute()
First, we get the data from the result spreadsheet. Next loop through each index counting the number of times each shirt occurs. Then we convert it to a list and add the result to the Count
page of the final spreadsheet.
Full code: Sheet Scraper Result: Spreadsheet