The FIFA men's world ranking is a system to rank men's national teams in association football. FIFA publishes the updated ranking each month but from 2022 it has started every two months. This project demonstrates step-by-step instruction on scraping ranking data using Selenium Python package and Power BI for data visualization.
1. Dataset introduction
The FIFA ranking data was tracked from 31 Dec 1992 up to 22 Dec 2022. Pagination is in place so only 50 teams per page. There are about more than 200 teams for each ranking period.
Example of FIFA men's ranking table
Each ranking period is associated with a respective URL formated as highlighted below. The ID number started from 1 for 31 Dec 1992 to 152 for 17 Jan 2007 then it follows a random number.
https://www.fifa.com/fifa-world-ranking/men?dateId=ID
- 22 Dec 2022: https://www.fifa.com/fifa-world-ranking/men?dateId=id13869
- 06 Oct 2022: https://www.fifa.com/fifa-world-ranking/men?dateId=id13792
- ...
- 17 Jan 2007: https://www.fifa.com/fifa-world-ranking/men?dateId=id152
- 18 Dec 2006: https://www.fifa.com/fifa-world-ranking/men?dateId=id151
- ...
- 31 Dec 1992: https://www.fifa.com/fifa-world-ranking/men?dateId=id1
2. Web scraping
Selenium Python package is used to automate and interact with web browser (I used Chrome for this project) to scrap data.
Scraping process
- Retrieve the list of ID number for all ranking periods
- Refresh current page with the new ranking period
- For each ranking period, scrap team data from current page and navigate to the next pages
- Update into main dataset
Analysis of the HTML structure
All the ranking periods with id numbers are setup in the JSON format (key value) and embedded in the script tag with ID "__NEXT_DATA__". Scraping these periods using Selenium webdriver find_element method
Team data can be retrieved from the HTML table row with class "row_rankingTableFullRow__Y_A4i". This can be done by using find_element method with CSS selector.
Only the first 50 teams are displayed in current page, users should click on the pagination section as below to view the next 50 teams. Therefore, nagivation to the next page for scraping can be done automatically using the same find_element method with XPATH
Data cleaning
Scraped data is pretty much standardized and clean, except for the minor issues as follows
- The format of the month of September is "DD Sept YYYY" which is different with other periods cwhich are DD MM YYYY.
- The name of several countries should be standardized. For example, "Côte d'Ivoire", "Türkiye" can be replaced by Ivory Coast and Turkey
3. Data visualization
Scraped data will be imported into Power BI for analysis and visualization purposes.
Example of data relationship
Example of FIFA federation analysis
Example of country analysis
Example of the best movers of the year
Github link: https://github.com/phuphan13/FIFA-web-scraping
No comments:
Post a Comment