Clement Thorey | AGU fall meeting - Part1 : Scrapping the data

American Geophysical Union (AGU) meeting is a geoscience conference hold each year around Christmas in San Francisco. It represents a great opportunity for PhD students like me to show off their work and enjoy what the west coast has to offer.

AGU logo

However, with nearly 24 000 attendees, AGU Fall Meeting is also the largest Earth and Space science meeting in the world. As such, it represents an interesting data set to explore the important trend in the geoscience academic world. This year, I decided to step back and look at the mix of more than 23 000 oral and poster presentations from a data science perspective.

In this post, I explain how I gathered the data from the Science meeting website.

Scrapping the data for each contribution

I usually used mechanize as a browser simulator to download the page and BeautifulSoup to parse the HTML content. However, the scientific program results to be mainly JavaScript calls and this strategy simply did not work in this case. This is because mechanize generates an HTTP request to get the web page and deliver the received HTML directly. However, the pieces of information are mainly auto-generated by a JavaScript code embedded in the HTML page in our case which is not supported by mechanize.

Some googling later, I discover that selenium is a much better option when scrapping complicated HTML pages. Instead of simulating a browser, selenium allows you to directly interact with chromium, safari or whatever browser you are using when surfing the web.

Using selenium, the vanilla code to scrap a web page looks like

from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# open the browser
wd = webdriver.Chrome(os.path.join(PATH_TO_DRIVER, 'chromedriver'))

# visit the page 
wd.get(link)

# wait for it to generate a specific element
WebDriverWait(wd, timeout).until(
            EC.visibility_of_element_located((By...., ...)))

# scrap the element
data = wd.find_element_by....().text

# quit when you are done
wd.quit()

And that’s all you need.

Modern browsers usually provide a way for the user to inspect the HTML code to identify the different elements contained within the page. For instance, in chrome, visit the page, right click on it and select inspect to get the full structure of the web page. This way, it is easy to identify the elements in the content you are interested in.

The webdriver gets a very reach API to gather the desired content either by tag, class or id. In our case, we are going to use the class name. For instance, the contribution abstract is always contained in an element class named Additional. Therefore, gathering the abstract of this specific contribution translates in python code to

wd = webdriver.Chrome(os.path.join(PATH_TO_DRIVER, 'chromedriver'))
link = 'https://agu.confex.com/agu/fm15/meetingapp.cgi/Paper/67077'
wd.get(link)
WebDriverWait(wd, 3).until(EC.visibility_of_element_located((By.CLASS_NAME, 'Additional')))
abstract = wd.find_element_by_class_name('Additional').text
wd.quit()

For each contribution, I finally decide to collect all the information available, i.e. namely the tag, title, date, time, place, abstract, reference, authors, session and section. The data are available here for both 2014 and 2015 as json files, each of them containing a chunk of 1000 contributions. The code to read the json files is also available here.

For instance, taking a look at all the contributions of 2015 resumes to

from geocolab.Data_Utils import *
path_data = 'Path_geocolab/data/data_agu2015'
data = get_all_data(path_data)

where data is a list of all the contribution. For clarity, each contribution is defined as a class Paper whose attributes correspond to the elements that identify this contribution in the website.

Getting the country of origin for each participant.

One thing which is missing in each contribution page is the country of origin for each participant. Indeed, I wanted to take a look a the geographic distribution of AGU contributors by curiosity.

My first idea was to extract this knowledge directly from the scientific institution each researcher is affiliated with. Indeed, these are given on each contribution page and for instance, for one of my contribution, we have

from geocolab.Data_Utils import *
path_data = 'Path_geocolab/data/data_agu2015'
data = get_all_data(path_data)
mycontributions = [contrib for contrib in data if 'Clement Thorey' in contrib.authors.keys()]
print 'Title: %s'%(mycontributions[0].title[0])
print 'Author: %s'%(mycontributions[0].authors.keys()[0])
print 'Insitution: %s'%(mycontributions[0].authors.values()[0])

which returns

- Title: Floor-Fractured Craters through Machine Learning Methods
- Author: Clement Thorey
- Institution: Institut de Physique du Globe de Paris

From here, I thought extracting the country from the institution could be easy. For instance, the module geopy proposes a very rich API to do exactly that

from geopy.geocoders import Nominatim
geolocator = Nominatim()
a = geolocator.geocode("Institut de Physique du Globe de Paris")
print a.adress

returns

- Institut de Physique du Globe de Paris, 1, Rue Jussieu, 5e
Arrondissement, 5e, Paris, Ile-de-France, France
metropolitaine, 75005, France

Pretty accurate! Nevertheless, while this works for some institutions, it also fails for many others. I tried with several different geocoder class, in particular the Google Geocoding API (V3), but I got the same insufficient results. The principal causes of failure of the different geocoder classes appear to be the language translation of the home institution. For instance, most of Chinese researchers will provide their institution name in their english translated versions which the geocoder API does not recognize.

Therefore, after several unsuccessful attempt, I decided to extract this information directly from the website. To do that, I run a similar scrapping procedure that the one described above on what’s look like an index of all AGU members contained in the Person section of the website . Indeed, each person contributing to the AGU appears to be referenced into this part of the website with its given

name
address + country
contributions

These datas are also available as json files by chunk of 5000 names here for both 2014 and 2015 AGU. The scripts used for scrapping are located here.