Documentation

Documentation Direct Data Service

Direct Data Service

The direct data service allows you to download files from the CADC archive using a URL. You can download directly a file either from your browser, automate downloading multiple files from the command-line, or within a python script. If the file is in the FITS format, the Direct Data Service can also retrieve only parts of the files, such as headers, cutouts or single HDUs.

Below, you will find documention to access the CADC Direct Data Service:

In order to use the Direct Data service, both a CADC Archive Name and a File Identifier are required.

The CADC Direct Data Service URL

The most basic form of the Direct Data Service URL currently accepts the following format:

    https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/{ARCHIVE}/{fileID}[OPTIONS]

Example: https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/CFHT/1722795p.fits.fz?fhead=true

Element Value Description
{ARCHIVE} CFHT Identifies the data archive
{fileID} 1722795p.fits.fz Identifies the file ID in the archive
[OPTIONS] fhead=true Following the filename you can add options, in this case, the FITS header

Determining the archive name and the file identifier

Typically the Direct Data Service is meant to be used following a request to another CADC service, such as a data search query resulting from the CADC AdvancedSearch service. The result of the search will contain the full URL with the archive name and file IDs, and can be saved in a file.

  • If you know the file identifier ahead of time, for example if you received it from an observatory, or else if you guess it after years of using the service, you can use the Direct Data Service directly. The file identifiers represent what the archive data provider had designated the file before uploading to the CADC archive and the naming pattern for file IDs is dependent on the origin archive.

  • The archive names are listed here. An archive name typically represents the name of an observatory or the name of a sky survey.

  • Note: for the FITS format, file names (e.g. 1722795p.fits.fz) and file IDs (e.g. 1722795p) often both work, but this not always the case.

Browser Usage

If you only need to download one file from a CADC archive, the simplest way is to open your browser and enter the URL as described above in the browser URL bar.

Example:

Command-line Usage

A request to the Direct Data Service can be performed from the command line. Traditional web command line client such as wget, curl, or httpie can be used, and CADC provides a slightly more evolved command-line client cadc-data. We detail their usage below.

Standard command-line wget, curl

wget and curl are standard command-line to access web services and are often already installed on a computer (Mac and Linux).

  • Example: downloading data from the HLADR2 archive:
      $ wget https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/HLADR2/hst_05476_4r_wfpc2_total_pc_drz.fits.gz
      $ curl -O -J -L https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/HLADR2/hst_05476_4r_wfpc2_total_pc_drz.fits.gz
    For curl to behave like wget, we had to specify the following options:
    • -O -J : will save the file locally (using the server-specified Content-Disposition filename if available, else extracts a filename from the URL) instead writing it to STDOUT.
    • -L : will ensure to redirect the URL.

If the data you are downloading isn't public, you will need to specify your CADC username and password to access it.

Example:

    $ wget --user=fred --password=passwd123 https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/HLADR2/hst_05476_4r_wfpc2_total_pc_drz.fits.gz
    $ curl -u fred:passwd123 -O -J -L https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/HLADR2/hst_05476_4r_wfpc2_total_pc_drz.fits.gz

Scripting:

wget or curl can also be used in scripts. It returns a non-zero exit status when an error occurs during execution.
Example:

  • Search for M101 on CADC AdvancedSearch on CFHT Megaprime. Mark all images, and click on Download, selecting the URL list in a file, which will download a file under the name cadcUrlList.txt. Then you can run this one-liner command to automatically download all the files listed in the query with 3 concurrent threads:
      $ cat cadcUrlList.txt | xargs -P3 wget --content-disposition

Note: you can also automate the search with the cadctap python package.

Both command lines come with many options. Use wget --help, curl --help will show them all. We only list some common ones below.

Commonly used options for wget:

  • --user=username --password=password: specify username and password.
  • -nv : non-verbose. wget sends a lot of information to STDOUT. If you are running wget in a script, you want this option.
  • -q: quiet mode.
  • -t, --tries=NUMBER : set number of retries to NUMBER (5 recommended).
  • --waitretry=SECONDS : wait 1..SECONDS between retries of a retrieval. By default, wget will assume a value of 10 seconds.
  • -N, --timestamping : Turn on time-stamping and download only missing or updated files.
  • --content-disposition : Forces wget to give the proper name to the downloaded file.
  • --certificate=file : Use the certificate in file for authentication.

Commonly used options for curl:

  • -O : save the file locally with the same name as the remote version.
  • -J : use the server-specified Content-Disposition filename.
  • -L : follow redirects.
  • -u username:password : specify username and password. If you just specify the username, it will prompt you for your password.
  • -s : make curl run quietly. If you are running curl in a script, you want this option.
  • --retry NUMBER set number of retries to NUMBER (5 recommended).
  • --data-urlencode : encode an non-friendly URL data into a URL friendly one, useful with cutouts.

CADC Client cadc-data Usage

cadc-data is a software package for accessing the CADC Direct Data Service. It includes the command line of the same name. It is written in python and can be installed with:

    $ pip install cadcdata

cadc-data command-line:

The command line cadc-data can perform the following actions:

  • retrieve one or more files at a time from an archive.
  • upload files to an archive.
  • discover information about specific files.
  • automatically discover the data service URL's and failover to another URL if an error occurs transferring a file.
  • automatically retry on errors when a download is interrupted.
  • check that the MD5 checksum of the downloaded file matches the MD5 checksum stored with the CADC to ensure the integrity of the file.

Usage:

    $ cadc-data get {ARCHIVE} {fileID}

Example:

  • Download the file hst_05476_4r_wfpc2_total_pc_drz.fits.gz from the public HLADR2 archive to the current directory:
    $ cadc-data get HLADR2 hst_05476_4r_wfpc2_total_pc_drz.fits.gz

cadc-data commonly used options:

You can adapt cadc-data to your use case with options. Below is a list of some useful options when downloading data.

  • -u, --user=USER : If the data is not public, this option allows to specify the CADC USER to access protected data. The command will prompt for your CADC password. Example: The user John Smith with CADC username johnsmith is downloading the protected file hst_05476_4r_wfpc2_total_pc_drz.fits.gz:

      $ cadc-data get --user=johnsmith HLADR2 hst_05476_4r_wfpc2_total_pc_drz.fits.gz
      johnsmith@ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca
      Password: ********

    To avoid being prompted for a password, use instead the options -c or -n.

  • -c, --cert=/path/to/cert : specify the path a X509 temporary proxy certificate to use for authentication. Get a proxy certificate once, and re-use it multiple times for fun and profit, or send it to your trusted collaborators. Example:

      $ cadc-get-cert -u johnsmith
      johnsmith@ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca
      Password: ********
    
      $ cadc-data get --cert ~/.ssl/cadcproxy.pem HLADR2 hst_05476_4r_wfpc2_total_pc_drz.fits.gz
  • -n, --netrc-file=/path/to/netrc : allowsthe legacy .netrc file format for authentication of a web service. The file has in clear text the CADC username and password, so use with with caution. Its default location is ${HOME}/.netrc. Example:

      $ cadc-data get -n CFHT 700000o.fits.fz
  • --fhead : will download the FITS header information. Example:

      $ cadc-data get -v -n --fhead GEM mrgN20091214S0271_add.fits
  • -z, --decompress: will decompres the data (gzip only) on-the-fly and save a decompressed version of it.

  • -o, --output=OUTPUT : space-separated list of destination files (quotes required for multiple elements).

  • --cutout [CUTOUT [CUTOUT ...]] : specify one or multiple extension and/or pixel range cutout operations to be performed. Use a minimal cfitsio syntax. Example:

      $ cadc-data get CFHT 700000o --cert ~/.ssl/cadcproxy.pem -o /tmp/700000o-cutout.fits --cutout [1]
  • -q, --quiet: will perform the operation quietly

  • -v, --verbose: will show more dialogues and progress-bar for downloads.

You can find the full list of options by running cadc-data get --help from a terminal.

cadc-data scripting:

cadc-data can also be used in scripts. It returns a non-zero exit status when an error occurs during execution.
Examples:

  • Download I001B3H0.fits and I016B4H0.fits files from the IRIS archive
    #!/bin/bash
    archive=IRIS
    for file in I001B3H0.fits I016B4H0.fits; do
      echo "getting $archive $file"
      cadc-data get $archive $file && echo "done" || echo "failed"
    done
  • Search for M101 on CADC AdvancedSearch on CFHT Megaprime, download the query result as a TSV file (here the result is saved as result_r140a9bqf8diqk82.tsv), and run this one-liner to automatically download all files listed in the query with 3 concurrent threads:
      $ awk 'NR>1 {print $2,$4}' result_r140a9bqf8diqk82.tsv | xargs -P3 -n2 cadc-data get -v

FITS Cutouts Retrieval

If you are dealing with FITS files, and you know you are only interested in small parts of the files, you can limit your retrievals to cutouts. A number of cutout parameters may be included, using a subset of the CFITSIO image section specification for cutout specification. Cutouts need to be encoded with cutout=<value> format in the URL or with --cutout <value> with cadc-data.

  • Some examples of cutouts specifications for the <value>:
Value Explanation
[1:512:2,2:512:2] Open a 256x256 pixel image consisting of the odd numbered columns (1st axis) and the even numbered rows (2nd axis) of the image in the primary array of the file.
[*,512:256] Open an image consisting of all the columns in the input image, but only rows 256 through 512. The image will be flipped along the 2nd axis since the starting pixel is greater than the ending pixel.
[*:2,512:256:2] Same as above but keeping only every other row and column in the input image.
[-*,*] Copy the entire image, flipping it along the first axis.
[3][1:256,1:256] Opens a subsection of the image that is in the 3rd extension of the file.

Command-line cutout examples

We provide below some examples using both cadc-data and curl. For curl it is often less error-prones to embed the full cutout URL between quotes and encoding the cutout value which contains the [] brackets with the --data-urlencode option.

  1. Single FITS extension cutout
     $ cadc-data get CFHT 806045o.fits.fz --output 806045o-cutout1.fits --cutout [1]
     $ curl -L -G -o 806045o-cutout1.fits --data-urlencode "cutout=[1]" https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/CFHT/806045o.fits.fz
  2. Pixel coordinate cutout
     $ cadc-data get CFHTSG D3.IQ.R.fits --output D3.IQ.R.9979_10490_10573_11084.fits --cutout [9979:10490,10573:11084] 
     $ curl -L -G -o D3.IQ.R.9979_10490_10573_11084.fits --data-urlencode "cutout=[9979:10490,10573:11084]" https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/CFHTSG/D3.IQ.R.fits
  3. Extension and pixel coordinate cutout
     $ cadc-data get CFHT 806045o.fits.fz --output 806045o-cutout2.fits --cutout [1][1:100,1:200]
     $ curl -L -G -o 806045o-cutout2.fits --data-urlencode "cutout=[1][1:100,1:200]" https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/CFHT/806045o.fits.fz
  4. Multiple extension cutout
     $ cadc-data get CFHT 806045o --output 806045o-cutout3.fits --cutout [1] [2]
     $ curl -L -G -o 806045o-cutout3.fits --data-urlencode "cutout=[1]&cutout=[2]" https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/CFHT/806045o.fits.fz
  5. Multiple extension cutout with pixel coordinates

     $ cadc-data get CFHT 806045o.fits.fz --output 806045o-cutout4.fits --cutout [1][10:120,20:30] [2][10:120,20:30]
     $ curl -L -G -o 806045o-cutout4.fits --data-urlencode "cutout=[1][10:120,20:30]&cutout=[2][10:120,20:30]" https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/CFHT/806045o.fits.fz
  6. Alternatively, it is possible to specify a cutout by RA and Dec, using a slightly different service:

     $ curl -L -O -J "https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/caom2ops/sync?id=ad:CFHTSG/D2.I.fits&Circle=150.570478+2.172356+0.01"

    Where the numbers are RA, Dec and size, all in degrees. Remember that in a "+" (plus sign) in a URL means " ", a blank space.

FITS Headers retrieval

Using cadc-data to download a FITS header

cada-data has a --fhead option for downloading FITS header information:

    $ cadc-data get --fhead IRIS I001B3H0.fits

Using a data service URL to download a FITS header

When requesting a file of type FITS, providing the parameter fhead=true will result in the download of the header information of the file.

Example: view the headers of all extensions of a CFHT image:

    $ curl -L -G "https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/CFHT/806045o.fits.fz&fhead=true"

The fhead=true and the cutout parameters can not be simultaneously used. A potentially useful workaround is to request a cutout of only one pixel of a given extension, e.g. cutout=[1][1:1,1:1].

Advanced Usage of the Direct Data Service

So far only the read version was covered. The Direct Data Service can also be used to upload data and to get information of a file.

PUT: Uploading files

You may have be given access to a CADC archive, to which you can upload data to. To upload a file with the data service, you must have permission to write to the target archive.

  • The simplest is probably to use the command-line cadc-data with the syntax:
      $ cadc-data put {ARCHIVE} <file to upload>

An upload is done by performing an HTTP PUT to the URL identifying the file, and supplying the file data in the accompanying input stream of the request. If successful, an HTTP 201 response code will be returned. Here is an example using curl:

INFO: Retrieving metadata information of archive files

Use the cadc-data info argument to retrieve metadata for a file. The metadata available is described below:

Metadata Description
archive: The archive name
encoding: The type of encoding (typically compression) used (optional)
lastmod: Date of the last file modification (optional: not present when modified during delivery)
md5sum: The MD5 digest of the contents of the file.
name: Contains a suggested filename for clients that will write the file
size: Size of the file as delivered
type: The mimetype of the file (optional: only present if type is known)
umd5sum: The MD5 digest of the contents of the file when uncompressed. (optional: not present when modified during delivery)
usize: The size of the uncompressed file, in bytes (optional: not present when modified during delivery)

Example:

    $ cadc-data info IRIS I001B3H0.fits

    File I001B3H0.fit:
        archive: IRIS
       encoding: None
        lastmod: Tue, 25 Jul 2006 23:15:19 GMT
         md5sum: 2ada853a8ae135e16504aeba4e47489e
           name: I001B3H0.fits
           size: 1008000
           type: application/fits
        umd5sum: 2ada853a8ae135e16504aeba4e47489e
          usize: 1008000

Programming with the Direct Data Service API

If you want to program with the Direct Data Service API, we also host a full documentation of the web service functionalities which we summarize below.

Endpoints

The URL can be changed to access the various functionalities of the service. You can formulate the URL with https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/{endpoint}:

Endpoint Description
/data/pub Public data file transfer resource. /data/pub over HTTP does not gather user credentials, so if downloading a non-public file or uploading to a non-public folder, you will be redirected to /data/auth and challenged for a username/password.
/data/auth Authenticated data file transfer resource. This resource will challenge for a CADC username/password for authentication and authorization.
/data/pub SSL data file transfer resource. A client certificate must be used to connect to this SSL-based resource. You will be authorized based on the credentials in the certificate.
/data/transfer Transfer negotiation endpoint for uploads and downloads.
/data/transfer Transfer negotiation endpoint that takes client certificates for authentication and authorization.
/data/auth/transfer Transfer negotiation endpoint that takes username/password for authentication and authorization.
/data/availability Resource that can be used to check the availability of the data service. Performing an HTTP get to this resource will produce an XML document describing the state of the service.

Transfer techniques

  • Direct download: Perform an HTTP GET to /data/pub/{ARCHIVE}/{fileID} and receive a redirect to the preferred download location.
  • Direct upload: Perform an HTTP PUT to /data/pub/{ARCHIVE}/{fileID} and upload directly to the stream.
  • Negotiated download: HTTP POST a transfer document to /data/transfer (or /data/auth/transfer) and receive a transfer document with multiple download locations included.
  • Negotiated upload: HTTP POST a transfer document to /data/transfer (or /data/auth/transfer) and receive a transfer document with multiple upload locations included.

Authentication and Authorization

If trying to access a non-public file you will be required to authenticate either by a CADC username and password or through a client certificate over SSL. If the authentication (login) fails, you will get an HTTP 401 (Unauthorized) response. If you successfully authenticate but are not allowed to access to the file, you will get an HTTP 403 (Forbidden) response. If the file does not exist, you will get an HTTP 404 (Not Found) response.

Checking for file availability and access

To simply check if a file exists in a CADC archive, and that you have access to the file, using wget or curl you can perform an HTTP HEAD request to the same URL that you would use to download the file. This HEAD request will allow you confirm its existence, your authorization, and to gather basic meta-data about the file. To view the HTTP headers with curl, use curl --location --head or curl -L -I. With wget, use wget --server-response --spider Headers prefixed with an X- are custom CADC headers; all others are standard HTTP 1.1 headers.

HTTP Header Explanation
Content-Type The mimetype of the file (optional: only present if type is known)
Content-Encoding The type of encoding (typically compression) used (optional)
Content-Disposition Contains a suggested filename for clients that will write the file
Content-Length Size of the file as delivered
Content-MD5 The MD5 digest of the contents of the file
Last-Modified Date of the last file modification (optional: not present when modified during delivery)
X-Uncompressed-Length The size of the uncompressed file, in bytes (optional: not present when modified during delivery)
X-Uncompressed-MD5 The MD5 digest of the contents of the file when uncompressed. (optional: not present when modified during delivery)
X-CADC-Stream The name of the Stream to use when performing a PUT request. (optional: Default Stream is used when none specified.)

Data service and file names

You can use the Content-Disposition returned in the getData HTTP header to easily get wget to write the downloaded file to the name the file is stored in the archive with by using its --content-disposition flag. Note that you might want to also use the no-clobber option to avoid over-writing files you've already downloaded. There is not a curl option equivalent to the wget --content-disposition flag, but you could retrieve the HTTP header for the file, parse it for the content disposition and file name, then retrieve the file and saving it to that file name.

For URLs which specify a cutout, the suggested filename in the Content-Disposition header will include a extra part so that different cutouts from the same file will have different filenames. This extra part is intended to be somewhat human readable, though many characters are replaced with an underscore (_) to be generally more Internet and file system compatible. This extra part will be consistent between requests with the same cutout parameters.

Contact CADC for Assistance

For help and support with the data service, please email cadc@nrc.ca