EVOLUTION-MANAGER
Edit File: withCookies.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <title>Using Cookies for Connected Requests with RCurl</title><link rel="stylesheet" type="text/css" href="OmegaTech.css"></link><meta name="generator" content="DocBook XSL Stylesheets V1.76.1"></meta></head><body class="yui-skin-sam"><div class="article" title="Using Cookies for Connected Requests with RCurl"><div class="titlepage"><div><div><h2 class="title"><a id="id1170766181600"></a>Using Cookies for Connected Requests with <a xmlns="" href="http://www.omegahat.net/RCurl">RCurl</a></h2></div><div><div class="author"><h3 class="author"><span class="firstname">Duncan</span> <span class="surname">Temple Lang</span></h3><div class="affiliation"><span class="orgname">University of California at Davis<br></br></span> <span class="orgdiv">Department of Statistics<br></br></span></div></div></div></div><hr></hr></div><div class="section" title="The Problem"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="id1170765764697"></a>The Problem</h2></div></div></div><p> This is an example of using <a xmlns="" href="http://www.omegahat.net/RCurl">RCurl</a> with cookies. This comes from a question on the R-help mailing list on Sep 18th, 2012. </p><p> In a Web browser, we visit the page <code>http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18</code> (no longer available). <!-- <a class="ulink" href="http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18" target="_top">http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18</a>. --> Before allowing us access to the data, the Web site presents us with a disclaimer page. We have to click on the <span class="guibutton">I Agree</span> button and then we are forwarded to a page with the actual data. We want to read that using, for example, <i xmlns:r="http://www.r-project.org" xmlns:s="http://cm.bell-labs.com/stat/S4" xmlns:py="http://www.python.org" xmlns:perl="http://www.perl.org" xmlns:c="http://www.C.org" xmlns:vb="http://www.visualbasic.com" xmlns:omegahat="http://www.omegahat.net" xmlns:bioc="http://www.bioconductor.org" xmlns:java="http://www.java.com" xmlns:sql="http://www.sql.org" xmlns="" class="rfunc">readHTMLTable() </i> in the <a xmlns="" href="http://www.omegahat.net/RSXML">XML</a> package. </p><p> What happens when we click on the <span class="guibutton">I Agree</span> button? That sets a cookie. After that, we include that cookie in each request to that server and this confirms that we have agreed to the disclaimer. The Web server will process each request containing the cookie knowing we have agreed and so give us the data. So we need to first make a request in <b xmlns:xd="http://www.xsldoc.org" xmlns="" class="proglang">R</b> that emulates clicking the <span class="guibutton">I Agree</span> button. We have to arrange for that request to recognize the cookie in the response and then use that cookie in all subsequent requests to that server. We could do this manually, but there is no need to. We simply use the same curl object in all of the requests. In the first request, libcurl will process the response and retrieve the cookie. By using the same curl handle in subsequent requests, libcurl will automatically send the cookie in those requests. </p><p> We create the curl handle object with </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234504"><div><pre class="" title="R code"> library(RCurl) curl = getCurlHandle(cookiefile = "", verbose = TRUE) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> This enables cookies in the handle, but does not arrange to write them to a file. We could store the cookie in a file when the curl handle is deleted. We could then use this in subsequent <b xmlns:xd="http://www.xsldoc.org" xmlns="" class="proglang">R</b> sessions or other curl handles. However, there is no need to do this. We can just agree to the disclaimer each time. However, if we do want to store the cookie in a file (when the curl handle is deleted), we can do this by specifying a file name as the value for the cookiefile argument. </p><p> The disclaimer page is a <code xmlns:http="http://www.w3.org/Protocols" xmlns="" class="httpOp">POST</code> form. We send the request to <code>http://www.wateroffice.ec.gc.ca/include/disclaimer.php</code> <!-- <a class="ulink" href="http://www.wateroffice.ec.gc.ca/include/disclaimer.php" target="_top">http://www.wateroffice.ec.gc.ca/include/disclaimer.php</a> --> with the parameter named disclaimer_action and the value "I Agree". We can get this information by reading the <b xmlns:xd="http://www.xsldoc.org" xmlns="" class="progLang">HTML</b> page and looking for the <code xmlns="" class="xmlTag"><form></code> element. Alternatively, we could use the <a xmlns="" href="http://www.omegahat.net/RHTMLForms">RHTMLForms</a> package. </p><p> We can make the request with </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234537"><div><pre class="" title="R code"> postForm("http://www.wateroffice.ec.gc.ca/include/disclaimer.php", disclaimer_action = "I Agree", curl = curl) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> We can ignore the result as we just want the side-effect of getting the cookie in the curl handle. </p><p> We can now access the actual data at the original URL. We cannot use <i xmlns:r="http://www.r-project.org" xmlns:s="http://cm.bell-labs.com/stat/S4" xmlns:py="http://www.python.org" xmlns:perl="http://www.perl.org" xmlns:c="http://www.C.org" xmlns:vb="http://www.visualbasic.com" xmlns:omegahat="http://www.omegahat.net" xmlns:bioc="http://www.bioconductor.org" xmlns:java="http://www.java.com" xmlns:sql="http://www.sql.org" xmlns="" class="rfunc">readHTMLTable() </i> directly as that does not use a curl handle, and does not know about the cookie. Instead, we use <i xmlns:r="http://www.r-project.org" xmlns:s="http://cm.bell-labs.com/stat/S4" xmlns:py="http://www.python.org" xmlns:perl="http://www.perl.org" xmlns:c="http://www.C.org" xmlns:vb="http://www.visualbasic.com" xmlns:omegahat="http://www.omegahat.net" xmlns:bioc="http://www.bioconductor.org" xmlns:java="http://www.java.com" xmlns:sql="http://www.sql.org" xmlns="" class="rfunc">getURLContent() </i> to get the content of the page. We can then pass this text to <i xmlns:r="http://www.r-project.org" xmlns:s="http://cm.bell-labs.com/stat/S4" xmlns:py="http://www.python.org" xmlns:perl="http://www.perl.org" xmlns:c="http://www.C.org" xmlns:vb="http://www.visualbasic.com" xmlns:omegahat="http://www.omegahat.net" xmlns:bioc="http://www.bioconductor.org" xmlns:java="http://www.java.com" xmlns:sql="http://www.sql.org" xmlns="" class="rfunc">readHTMLTable() </i>. So we make the request with </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234557"><div><pre class="" title="R code"> u = "http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18" txt = getURLContent(u, curl = curl, verbose = TRUE) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> Personally, I prefer to use </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234564"><div><pre class="" title="R code"> txt = getForm("http://www.wateroffice.ec.gc.ca/graph/graph_e.html", mode = "text", stn = "05ND012", prm1 = 3, syr = "2012", smo = "09", sday = "15", eyr = "2012", emo = "09", eday = "18", curl = curl) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> This makes it easier to change individual inputs. </p><p> The result should contain the actual data. </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234573"><div><pre class="" title="R code"> library(XML) tbl = readHTMLTable(txt, asText = TRUE) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> We can find the number of rows and columns in each table with </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234577"><div><pre class="" title="R code"> sapply(tbl, dim) dataTable hydroTable [1,] 852 1 [2,] 2 4 </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> We want the first one. The columns are, by default, strings or factors. The numbers have a * on them. We can post-process this to get the values. </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234583"><div><pre class="" title="R code"> tbl = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) tbl[[2]] = as.numeric(gsub("\\*", "", tbl[[2]])) tbl[[1]] = strptime(tbl[[1]], "%Y-%m-%d %H:%M:%S") </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> </p></div><div class="section" title="Using RHTMLForms to Find the Disclaimer Form"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="id1170766234590"></a>Using <a xmlns="" href="http://www.omegahat.net/RHTMLForms">RHTMLForms</a> to Find the Disclaimer Form</h2></div></div></div><p> The <a xmlns="" href="http://www.omegahat.net/RHTMLForms">RHTMLForms</a> package can both read an <b xmlns:xd="http://www.xsldoc.org" xmlns="" class="progLang">HTML</b> page and get a description of all of its forms, and also generate an <b xmlns:xd="http://www.xsldoc.org" xmlns="" class="proglang">R</b> function corresponding to each form so that we can invoke the form as if it were a local function in <b xmlns:xd="http://www.xsldoc.org" xmlns="" class="proglang">R</b>. We get the descriptions with </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234614"><div><pre class="" title="R code"> library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> We need to keep the buttons in the forms and hence the <i xmlns=""><code>FALSE</code></i> as the second argument to <i xmlns:r="http://www.r-project.org" xmlns:s="http://cm.bell-labs.com/stat/S4" xmlns:py="http://www.python.org" xmlns:perl="http://www.perl.org" xmlns:c="http://www.C.org" xmlns:vb="http://www.visualbasic.com" xmlns:omegahat="http://www.omegahat.net" xmlns:bioc="http://www.bioconductor.org" xmlns:java="http://www.java.com" xmlns:sql="http://www.sql.org" xmlns="" class="rfunc">getHTMLFormDescription() </i>. </p><p> We create the function for this form with </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234627"><div><pre class="" title="R code"> fun = createFunction(forms[[1]]) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> </p><p> We can invoke this function using the curl handle we created to capture the cookies: </p><div xmlns="" class="codeToggle"><div class="unhidden" id="id1170766234635"><div><pre class="" title="R code"> fun(.curl = curl) </pre></div></div></div> <div xmlns="" class="clearFloat"></div> <p> This will agree to the disclaimer on our behalf. </p></div></div></body></html>