Here is a story for today. We have some public files stored on a Google Drive, which we would like to download automatically. Files that I was interested in are relatively big multi-frame Tiff files (231 Mb each).
I assume that you have file IDs. If you do not know what it is, please refer to numerous online pages. For example, lifehacker.
Having an ID one can easily construct a download link. The problem comes with big files. If you try to download a big file, Google will redirect you to a special page informing that it is not possible to scan the file for viruses. There will be a download link on that page. It won't be Google if you could just parse the response, get the new link and use it directly. No! Everything is very dynamic, new link will contain a confirmation code. But in order to work, it also needs a proper cookie. If you are on Linux machine, here is a script for you from StackOverflow. We are on whatever OS using Matlab.
Alright, Matlab has several ways to get web content. Webread is a very convenient top-level function, but you won't be able to use it because you need to preserve cookies. Matlab provides an example function to send arbitrary HTTP requests with cookie support. Here is the corrected version of that function for the reference:
function [response, retInfos, history] = sendRequest(uri, request) % uri: matlab.net.URI % request: matlab.net.http.RequestMessage % response: matlab.net.http.ResponseMessage % matlab.net.http.HTTPOptions persists across requests to reuse previous % Credentials in it for subsequent authentications persistent options % infos is a containers.Map object where: % key is uri.Host; % value is "info" struct containing: % cookies: vector of matlab.net.http.Cookie or empty % uri: target matlab.net.URI if redirect, or empty persistent infos if isempty(options) options = matlab.net.http.HTTPOptions('ConnectTimeout',20); end if isempty(infos) infos = containers.Map; end host = string(uri.Host); % get Host from URI try % get info struct for host in map info = infos(char(host)); if ~isempty(info.uri) % If it has a uri field, it means a redirect previously % took place, so replace requested URI with redirect URI. uri = info.uri; end if ~isempty(info.cookies) % If it has cookies, it means we previously received cookies from this host. % Add Cookie header field containing all of them. request = request.addFields(matlab.net.http.field.CookieField(info.cookies)); end catch % no previous redirect or cookies for this host info = ; end % Send request and get response and history of transaction. [response, ~, history] = request.send(uri, options); if response.StatusCode ~= matlab.net.http.StatusCode.OK return end % Get the Set-Cookie header fields from response message in % each history record and save them in the map. arrayfun(@addCookies, history) % If the last URI in the history is different from the URI sent in the original % request, then this was a redirect. Save the new target URI in the host info struct. targetURI = history(end).URI; if ~isequal(targetURI, uri) if isempty(info) % no previous info for this host in map, create new one infos(char(host)) = struct('cookies',,'uri',targetURI); else % change URI in info for this host and put it back in map info.uri = targetURI; infos(char(host)) = info; end end retInfos = infos; function addCookies(record) % Add cookies in Response message in history record % to the map entry for the host to which the request was directed. % ahost = record.URI.Host; % the host the request was sent to cookieFields = record.Response.getFields('Set-Cookie'); if isempty(cookieFields) return end cookieData = cookieFields.convert(); % get array of Set-Cookie structs cookies = [cookieData.Cookie]; % get array of Cookies from all structs try % If info for this host was already in the map, add its cookies to it. ainfo = infos(ahost); ainfo.cookies = [ainfo.cookies cookies]; infos(char(ahost)) = ainfo; catch % Not yet in map, so add new info struct. infos(char(ahost)) = struct('cookies',cookies,'uri',); end end end
Note that I also return some additional variable from the function:
retInfosis used to get confirmation code from a cookie.
historyis used to obtain a direct link to a file.
Why do we need an additional direct link after we got a confirmation code? Because we want to download multi-frame tiff files. Matlab tries to be smart and downloads only a single frame by default.
Let's check out code to download the file and save it on disc:
fileName = 'file_00002_00002.tif'; fileId = '0B649boZqpYG1OEZnV21ncDVNcVk'; fileUrl = sprintf('https://drive.google.com/uc?export=download&id=%s', fileId); request = matlab.net.http.RequestMessage(); % First request will be redirected to information page about virus scanning % We can get a confirmation code from an associated cookie file [~, infos] = sendRequest(matlab.net.URI(fileUrl), request); confirmCode = ''; for j = 1:length(infos('drive.google.com').cookies) if ~isempty(strfind(infos('drive.google.com').cookies(j).Name, 'download')) confirmCode = infos('drive.google.com').cookies(j).Value; break; end end newUrl = strcat(fileUrl, sprintf('&confirm=%s', confirmCode)); % We now need to send another request to get the file. % However, Matlab doesn't download the whole Tiff file, but only one frame. [~, ~, history] = sendRequest(matlab.net.URI(newUrl), request); % Thus we must use log information to find out a % direct link and downalod it as a raw file ind = arrayfun(@(x) ~isempty(strfind(x.URI.Host, 'googleusercontent')), history); ind = find(ind, 1); % we need the raw type in order to download the whole file and not just a single frame options = weboptions('ContentType', 'raw'); imgData = webread(history(ind).URI.EncodedURI, options); fid = fopen(fileName, 'wb'); fwrite(fid, imgData); fclose(fid);
Finally, we got the whole file saved in the location pointed by
fileName. Note there are no error checks in the code!
Here is a Gist with the same code.