My Data Science Diary: Scraping and visualizing FIFA men's ranking data

The FIFA men's world ranking is a system to rank men's national teams in association football. FIFA publishes the updated ranking each month but from 2022 it has started every two months. This project demonstrates step-by-step instruction on scraping ranking data using Selenium Python package and Power BI for data visualization.

1. Dataset introduction

The FIFA ranking data was tracked from 31 Dec 1992 up to 22 Dec 2022. Pagination is in place so only 50 teams per page. There are about more than 200 teams for each ranking period.

Example of FIFA men's ranking table

Each ranking period is associated with a respective URL formated as highlighted below. The ID number started from 1 for 31 Dec 1992 to 152 for 17 Jan 2007 then it follows a random number.

https://www.fifa.com/fifa-world-ranking/men?dateId=ID

22 Dec 2022: https://www.fifa.com/fifa-world-ranking/men?dateId=id13869
06 Oct 2022: https://www.fifa.com/fifa-world-ranking/men?dateId=id13792
...
17 Jan 2007: https://www.fifa.com/fifa-world-ranking/men?dateId=id152
18 Dec 2006: https://www.fifa.com/fifa-world-ranking/men?dateId=id151
...
31 Dec 1992: https://www.fifa.com/fifa-world-ranking/men?dateId=id1

2. Web scraping

Selenium Python package is used to automate and interact with web browser (I used Chrome for this project) to scrap data.

Scraping process

Retrieve the list of ID number for all ranking periods
Refresh current page with the new ranking period
For each ranking period, scrap team data from current page and navigate to the next pages
Update into main dataset

Analysis of the HTML structure

All the ranking periods with id numbers are setup in the JSON format (key value) and embedded in the script tag with ID "__NEXT_DATA__". Scraping these periods using Selenium webdriver find_element method

Team data can be retrieved from the HTML table row with class "row_rankingTableFullRow__Y_A4i". This can be done by using find_element method with CSS selector.

Only the first 50 teams are displayed in current page, users should click on the pagination section as below to view the next 50 teams. Therefore, nagivation to the next page for scraping can be done automatically using the same find_element method with XPATH

Example of scraped dataset as follows

Data cleaning

Scraped data is pretty much standardized and clean, except for the minor issues as follows

The format of the month of September is "DD Sept YYYY" which is different with other periods cwhich are DD MM YYYY.
The name of several countries should be standardized. For example, "Côte d'Ivoire", "Türkiye" can be replaced by Ivory Coast and Turkey

3. Data visualization

Scraped data will be imported into Power BI for analysis and visualization purposes.

Example of data relationship