May 18, 2024
On a daily or weekly basis, I look for Ph.D. studentships on various websites, chief among which being the good old cfd-online, which mostly posts CFD-related jobs at this URL. The issue with this website is the absence of an email notification feature or an RSS feed, which would otherwise enable me to check for new job postings directly from the comfort of the command line. This post is my attempt to automate the process of checking for new jobs.
I want to write a script to: 1. List all the recent (let's say 10) jobs, and 2. Check whether any new jobs have been posted.
For the following commands and scripts to work, we need two programs that may not be pre-installed by default on all Unix operating systems, namely curl
and perl
.
# Debian-based Linux Distros
sudo apt install curl perl
# MacOS
brew install curl perl
For those who just want to use the command without delving into the forthcoming details, here’s the one-liner:
curl --silent --location 'https://www.cfd-online.com/Jobs/listjobs.php?category=PhD%20Studentship' | perl -0777 -nle 'my $i = 1; print "$i. $1 ($2; $3)\n" and $i++ while /<a href="showjob\.php\?record_id=[0-9]{5}">(.*?)<\/a>.*?\s*<I>(.*?)<\/I><br>\s*(.*?)<br>/gs' | head -10
I am looking for the following information:
To extract these data, we first retrieve the data:
curl --silent --location 'https://www.cfd-online.com/Jobs/listjobs.php?category=PhD%20Studentship' --output jobs.html # or curl -sL
-s/--silent
: suppresses the progress information.-L/--location
: tells curl to follow redirects (If the requested URL has moved to a new location, curl
will follow the HTTP Location headers to fetch the new location).We can examine the structure of the page. The following section appears somewhere in the middle of the page:
...
<h2>CFD Jobs Database - List of Jobs</h2><p>
<table border=2 cellpadding=5 width=100%>
<H3>Category: PhD Studentship</H3> <tr>
<td>
<b><a href="showjob.php?record_id=19178">PhD in Thermal Management for Future Aerospace Propulsion </a></b><br>
<I>Heriot Watt University</I><br>
United Kingdom, Edinburgh<br>
<font size=-2 face=helvetica>Record Last Modified 12:43:29 May 17 2024, Closure Date Jul 22 2024</font>
</td>
<td align=CENTER>
<form action="showjob.php" method=GET><input type="hidden" name="record_id" value="19178"><input type="submit" value=" View Full Record ">
</form>
<font size=-2 face=helvetica>Read 166 times</font>
</td>
</tr>
<tr>
<td>
<b><a href="showjob.php?record_id=19173">Modeling and numerical simulation of parietal heat transfer</a></b><br>
<I>PROMES-CNRS</I><br>
France, Perpignan<br>
<font size=-2 face=helvetica>Record Last Modified 22:42:21 May 16 2024, Closure Date Not Specified</font>
</td>
<td align=CENTER>
<form action="showjob.php" method=GET><input type="hidden" name="record_id" value="19173"><input type="submit" value=" View Full Record ">
</form>
<font size=-2 face=helvetica>Read 184 times</font>
</td>
</tr>
...
As you can see, jobs are listed in a <table>
, with each title surrounded by a b
tag, the name of the institute inside an I
tag, and the location shortly follows the name of the institution before reaching a <br>
.
Before we go over how to extract the necessary information, I would like to point you to an extremely useful website, regex101.com, which allows you to experiment with (various dialects of) regex. Here's my regex:
Moving on, the following regex matches the information I need:
<a href="showjob\.php\?record_id=[0-9]{5}">(.*?)<\/a>.*?\s*<I>(.*?)<\/I><br>\s*(.*?)<br>
Let's break it down step by step:
<a href="showjob\.php\?record_id=[0-9]{5}">
:
<a href="showjob\.php\?record_id=
: Matches the literal string <a href="showjob.php?record_id=
. The \?
is used to escape the question mark (?
), and \.
is used to escape the dot (.
), making sure they are treated as literal characters.[0-9]{5}
: Matches exactly five digits, representing the job record ID.">
: Matches the closing quote and greater-than sign (">
) following the record ID.(.*?)
:
(.*?)
: This is the first capture group. The .*?
is a non-greedy match that captures any character (except for newline characters) as few times as possible. This capture group extracts the job title.<\/a>
:
<\/a>
: Matches the closing anchor tag (</a>
). The backslash (\
) escapes the forward slash (/
), ensuring it is treated as a literal character..*?\s*
:
.*?
: Matches any character (except for newline characters) as few times as possible (non-greedy match). This ensures we match everything between the job title and the institution name.\s*
: Matches any whitespace characters (including spaces, tabs, and newlines) zero or more times.<I>(.*?)<\/I>
:
<I>
: Matches the opening italic tag (<I>
).(.*?)
: This is the second capture group, capturing the institution name non-greedily.<\/I>
: Matches the closing italic tag (</I>
).<br>\s*(.*?)<br>
:
<br>
: Matches the HTML line break tag (<br>
).\s*
: Matches any whitespace characters zero or more times, allowing for any extra spaces between the HTML tags.(.*?)
: This is the third capture group, capturing the location non-greedily.<br>
: Matches the closing line break tag (<br>
).As shown in the image above, this is a PCRE-2 regex. Perl allows us to extract the data using the pattern above.
curl --silent --location 'https://www.cfd-online.com/Jobs/listjobs.php?category=PhD%20Studentship' | perl -0777 -nle 'my $i = 1; print "$i. $1 ($2; $3)\n" and $i++ while /<a href="showjob\.php\?record_id=[0-9]{5}">(.*?)<\/a>.*?\s*<I>(.*?)<\/I><br>\s*(.*?)<br>/gs'
Let's also dissect it step by step:
perl -0777 -nle '...'
:
-0777
: Enables slurping mode, making Perl read the entire input file as a single string. This is crucial for matching patterns that span multiple lines.-n
: Wraps the code in an implicit while (<>) { ... }
loop, iterating over the input file or standard input.-l
: Handles newline characters by automatically chomping the input line separator and adding it to the print
statement.-e
: Allows execution of the provided Perl code directly from the command line.'my $i = 1; print "$i. $1 ($2; $3)\n" and $i++ while /PATTERN/gs'
:
my $i = 1;
: Initializes a counter variable $i
to 1. This variable keeps track of the line number for each match.print "$i. $1 ($2; $3)\n"
:
"$i. $1 ($2; $3)\n"
: Constructs the output string. It prints the current value of the counter followed by the captured groups.
$i
: The current line number.$1
: The first capture group, which contains the job title.$2
: The second capture group, which contains the institution name.$3
: The third capture group, which contains the location."\n"
: Adds a newline character at the end of the output string.and $i++
: Increments the counter variable $i
after printing. The and
operator ensures that $i++
is executed only if the print
statement is successful.while /PATTERN/gs
:
while ...
: A loop that continues as long as the regex pattern matches./PATTERN/gs
: This is the regex pattern applied globally (g
) and treating the input as a single string (s
) (i.e., dot/period matches a newline as well).Finally, we can pipe it to head -10
to list the 10 most recent jobs:
curl --silent --location 'https://www.cfd-online.com/Jobs/listjobs.php?category=PhD%20Studentship' | perl -0777 -nle 'my $i = 1; print "$i. $1 ($2; $3)\n" and $i++ while /<a href="showjob\.php\?record_id=[0-9]{5}">(.*?)<\/a>.*?\s*<I>(.*?)<\/I><br>\s*(.*?)<br>/gs' | head -10
Output of the command (date: May 18, 2024):
1. PhD in Thermal Management for Future Aerospace Propulsion (Heriot Watt University; United Kingdom, Edinburgh)
2. Modeling and numerical simulation of parietal heat transfer (PROMES-CNRS; France, Perpignan)
3. PhD studentship - Modelling bubble-particle interactions (University of Birmingham; United Kingdom, Birmingham)
4. Numerical investigation on casting in ESF (Faculty of Eng and Info Sciences, University of Wollongong; Australia, NSW, Wollongong)
5. PhD within the Centre of Computational Engineering Sciences (Cranfield University; United Kingdom, Bedfordshire, Bedford)
6. Stochastic particle methods for two-phase flows (University of Stuttgart; Germany, Baden-Württemberg, Stuttgart)
7. PhD Candidate in fire safety and fire development in buildings (Norwegian University of Science and Technology; Norway, Trondheim)
8. Graduate Research assistant for CFD simulations of RDE (University of Texas San Antonio; United States, TX, SAN ANTONIO)
9. Atmospheric Thermal Transport (University of South Florida; United States, FL, Tampa)
10. Simulations of offshore wind farm turbulence (University of Twente; Netherlands, Overijssel, Enschede)
Now, at this stage, we have the option to encapsulate this process within a function or a shell script
getcfdjobs() {
# number of recent jobs (by default: 10 jobs)
cnt=${1:-10}
curl --silent --location 'https://www.cfd-online.com/Jobs/listjobs.php?category=PhD%20Studentship' \
| perl -0777 -nle \
'my $i = 1; print "$i. $1 ($2; $3)\n" and $i++ while /<a href="showjob\.php\?record_id=[0-9]{5}">(.*?)<\/a>.*?\s*<I>(.*?)<\/I><br>\s*(.*?)<br>/gs' \
| head -$cnt
}
With this, one can retrieve, for example, the last 5 jobs:
$ getcfdjobs 5
1. PhD in Thermal Management for Future Aerospace Propulsion (Heriot Watt University; United Kingdom, Edinburgh)
2. Modeling and numerical simulation of parietal heat transfer (PROMES-CNRS; France, Perpignan)
3. PhD studentship - Modelling bubble-particle interactions (University of Birmingham; United Kingdom, Birmingham)
4. Numerical investigation on casting in ESF (Faculty of Eng and Info Sciences, University of Wollongong; Australia, NSW, Wollongong)
5. PhD within the Centre of Computational Engineering Sciences (Cranfield University; United Kingdom, Bedfordshire, Bedford)
The second step involves checking if the website has been updated. There are several ways to accomplish this:
The first option isn't suitable here, as the number of visits to jobs is updated constantly regardless of whether jobs themselves have been updated. Between the second and third options, the latter seems more efficient, as it only requires reading the page once. The second option involves first reading the last update time and then reading the jobs.
To implement this, we need to store the jobs in a sort of database file:
getcfdjobs 100 > $HOME/.cfdjobs
The following command informs us whether the page has been updated with new jobs:
diff <(getcfdjobs 1) <(head -1 $HOME/.cfdjobs) && echo "New job not detected" || echo "No new job"
However, the <()
syntax is not available in simpler and more restricted POSIX-compliant shells like dash (commonly used in many Linux distributions where /bin/sh
is a symlink to /usr/bin/dash
). In such shells, we can achieve the same result by doing the following:
# Make temporary files
trap 'rm -f cfd_jobs_old cfd_jobs_new' EXIT INT TERM
cfd_jobs_old=$(mktemp)
cfd_jobs_new=$(mktemp)
# First get the latest job
head -1 "$HOME/.cfdjobs" > cfd_jobs_old
# Update the list of jobs and get the newest job
getcfdjobs 100 | tee $HOME/.cfdjobs | head -1 > cfd_jobs_new
# Perform diff on the named pipes
if ! diff cfd_jobs_old cfd_jobs_new; then
echo "New job detected"
else
echo "No new job"
fi
With all of these components in place, we can now write a script to check and update the jobs and notify when it happens.
The script is named checkcfdjobs
.1
#!/bin/sh -e
# file: checkcfdjobs
# description: Search for jobs on cfd-online.com and notify if new jobs added
# usage: checkcfdjobs
notify() {
if [ `uname -s` = "Linux" ]; then
notify-send "[CFD-Online] New job!"
elif [ `uname -s` = "Darwin" ]; then
osascript -e 'display notification "New job!" with title "CFD-Online" sound name "Blow"'
else
printf "OS not supported!\n" && exit 1
fi
}
getcfdjobs() {
cnt=${1:-10}
curl --silent --location 'https://www.cfd-online.com/Jobs/listjobs.php?category=PhD%20Studentship' \
| perl -0777 -nle \
'my $i = 1; print "$i. $1 ($2; $3)\n" and $i++ while /<a href="showjob\.php\?record_id=[0-9]{5}">(.*?)<\/a>.*?\s*<I>(.*?)<\/I><br>\s*(.*?)<br>/gs' \
| head -$cnt
}
trap 'rm -f cfd_jobs_old cfd_jobs_new' EXIT INT TERM
cfd_jobs_old=$(mktemp)
cfd_jobs_new=$(mktemp)
head -1 "$HOME/.cfdjobs" > cfd_jobs_old
getcfdjobs 100 | tee $HOME/.cfdjobs | head -1 > cfd_jobs_new
if ! diff cfd_jobs_old cfd_jobs_new; then
notify
else
echo "No new job"
fi
Add this line to your shell startup configuration file (.zshrc
or .bashrc
):
checkcfdjobs
The script will now execute every time a new shell is spawned.
Add the following job to your crontab (crontab -e
):
# Every 3 hours
0 */3 * * * /absolute/path/to/checkcfdjobs
Additionally there are ways to send email from command-line, but I skip discussing that here.
Put this script somewhere in your PATH
(usually in $HOME/bin
or $HOME/.local/bin
).