Step -by -step manual to use R and Selenium to scrape Empleos Publicos | R-Bloggers

Step -by -step manual to use R and Selenium to scrape Empleos Publicos | R-Bloggers

4 minutes, 15 seconds Read

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report problems here about the content on this page)


Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.

Due to delays with my stock market payment, if this message is useful for you, I kindly request a minimal donation Buy a coffee for me. It will be used to continue my open source efforts. The complete explanation is here: A personal message from an Open Source employee.

Motivation

My friend Nicolas Didier asked me about reading Public jobs With R or Python. Here is a short example for him and everyone who can benefit from this.

The following steps were adjusted from a tutorial that I had given in 2023 at the University of Michigan (Go Blue!).

Required R -PACKAGES

  • Relselenium: R-selenium integration
  • Rvest: HTML processing
  • DPLYR: To load the pipe operator (can be used later for data cleaning)
  • PURRR: ITERATION (ie repeated edits)

I have installed Relenium from the R console:

if (!require(RSelenium)) install.packages("RSelenium")

# or

remotes::install_github("ropensci/RSelenium")

For the rest of the packages:

if (!require(rvest)) install.packages("rvest")
if (!require(dplyr)) install.packages("dplyr")
if (!require(purrr)) install.packages("purrr")

Installation of selenium and chrome/chromium

Note for Ubuntu/Debian users: we have to check that chrome or chromium is installed in our system. One of the many options is to use the Bash console.

sudo add-apt-repository ppa:savoury1/chromium
sudo apt update
sudo apt install chromium-browser
sudo apt install chromium-chromedriver

Not using the PPA installs the Snap version of chromium, which is not compatible with selenium.

I tried to start Selenium as mentioned in the official guide And it didn’t work.

I had to install chrome. I am on Manjaro and I ran sudo pacman -S chromium. Windows/Mac users can use Google Chrome.

An extra requirement was to download Selenium server. Based on thisI started making a folder to store the data for this message by typing it in the US code terminal:

mkdir -p /tmp/didier-example
cd /tmp/didier-example

Then I opened R Witn R and the Jar file downloaded:

url_jar <- "https://github.com/SeleniumHQ/selenium/releases/download/selenium-3.9.1/selenium-server-standalone-3.9.1.jar"
sel_jar <- "selenium-server-standalone-3.9.1.jar"

if (!file.exists(sel_jar)) {
  download.file(url_jar, sel_jar)
}

I had to lead Selenium from a new terminal:

cd /tmp/didier-example
java -jar selenium-server-standalone-3.9.1.jar

Back to the r -terminal I was finally able to r:

library(RSelenium)
library(rvest)
library(dplyr)
library(purrr)

rmDr <- remoteDriver(port = 4444L, browserName = "chrome")

rmDr$open(silent = TRUE)

url <- "https://www.empleospublicos.cl"

rmDr$navigate(url)

This should display a new chrome/chrome window that says: “Chrome is controlled by automated test software”.

Scrape the data

With the help of the browser inspector (Ctrl + Shift + I) I investigated the page to see that the search bar matches:

For example, I can search for “Ministerio de Salud” because there were many messages from that organization on the destination page:

search_box <- rmDr$findElement(using = "id", value = "buscadorprincipal")
search_box$sendKeysToElement(list("Ministerio de Salud", key = "enter"))

That typed “Ministerio de Salud” and clicked on my behalf on searches. Inspecting the results that I see that every job offer starts

The first offer listed is this:


Ministerio de Salud

ConstituciĂłn

No pide experiencia

html <- read_html(rmDr$getPageSource()[[1]]) offers <- html %>% html_nodes("div.items") offers_tbl <- map_df(offers, function(offer) { # Extract position (job title) position <- offer %>% html_node("h3 a") %>% html_text(trim = TRUE) # Extract organization (usually the first

inside .top) organization <- offer %>% html_node(".top p") %>% html_text(trim = TRUE) # Extract city (the second

inside .cnt) city <- offer %>% html_nodes(".cnt p") %>% .[2] %>% html_text(trim = TRUE) tibble( position = position, organization = organization, city = city ) })

The result has the following structure:

offers_tbl
# A tibble: 552 Ă— 3
   position                                                   organization city 
                                                                 
 1 Medico (a) especialista en Anestesiología 44 horas         Servicio de… Cons…
 2 Titulares de la Planta Profesional Ley 18.834              Servicio de… Valp…
 3 ENFERMERA-O, JORNADA DIURNA, GRADO 12, PARA SERVICIO CLÍN… Servicio de… Reco…
 4 Psiquiatra infanto-juvenil sistema de atención intersecto… Servicio de… La P…
 5 Neurólogo(a) adulto GES Alzheimer y otras demencias        Servicio de… Puen…
 6 Médico(a) especialista en Neurología Infantil Hospital de… Servicio de… Cast…
 7 Arquitecto de Software                                     Central de … Ñuñoa
 8 TENS OPERADOR DE EQUIPOS DE ESTERILIZACIÓN                 Servicio de… Peña…
 9 (850-2892) Médico Especialista Broncopulmonar o Internist… Servicio de… Talc…
10 Enfermero(a) Clínico(a) Atención Abierta y Cerrada         Servicio de… Huas…
glimpse(offers_tbl)
> glimpse(offers_tbl)
Rows: 552
Columns: 3
$ position      "Medico (a) especialista en Anestesiología 44 horas", "Ti…
$ organization  "Servicio de Salud Maule / Hospital de Constitución", "Se…
$ city          "Constitución", "Valparaíso", "Recoleta", "La Pintana", "…

I know this is a simple example, but must allow different types of explorations and data extraction. I hope it helps.


#Step #step #manual #Selenium #scrape #Empleos #Publicos #RBloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *