Data File Storage (Aurora)

Aurora

Aurora, EMSL’s scientific data archive, is a dedicated computer system specifically designed for long-term storage of data collected by EMSL instruments. It is available at no cost to EMSL users who are part of an active EMSL user project, and follows EMSL’s Data Management policies.

Aurora is safe

  • Designed to safely & securely hold data on a long-term basis

  • Stores multiple copies of data to protect from data loss in case of media failure

  • Expertly maintained by skilled information systems professionals from EMSL’s Molecular Sciences Computing Group

  • Monitored continuously for optimum performance

Aurora is free

  • No cost

  • Petabytes of available storage

Aurora is easy to use

  • Research groups own the data

  • Researchers have full control to share files as needed to best serve their research projects

  • Simple, straightforward access is available through Windows file sharing, FTP, and SCP protocols

About the Aurora File System (/archive)

Aurora is based on IBM’s High Performance Storage System (HPSS) with customizations specific to EMSL. The EMSL customizations provide simple access methods for users who are not accustomed to HPSS.

HPSS uses a combination of high-performance disk storage and high-capacity tape storage. This allows a good tradeoff between performance and expense for high-capacity storage systems. New files are written first to disk then copied to tape shortly thereafter. After a period of time, the disk copy will be deleted to save space, leaving the data on tape.

Aurora Policies

Aurora is intended for storage of important data that has long-term intrinsic value or data that is too costly or difficult to reproduce. Although no restrictions are placed on the format of data archived, the archive is never to be used to store:

  • Backups of PCs and workstations

  • Classified information

  • Data for which the storage is regulated or required for regulatory compliance

  • Data where human health and safety depend on the access and accuracy of the data.

The following uses are permitted only with prior approval of the Molecular Sciences Computing Operations Capability Lead:

  • Storing of scratch files and intermediate results (only when such data is needed for ongoing research and the cost of recreating the data is prohibitive)

  • Storing of data produced for or by non-EMSL work.

Policy Violations

Users who store data in violation of these policies may have the data removed and their archive access privileges suspended. Such users will be notified before any of their data is deleted.

Data Access

The data owner controls access to the data. By default, only the owner of the data will have any permission for files and directories. The owner can grant access permissions to other users so that files can be shared among individuals, or within a work group. This access can include any combination of read, write, and delete permissions. Projects can also own data. In this case, project members share access to the project’s data. As with any computer system, the administrators of the archive can always get at data on the system. This will be done only for error recovery or when requested by authorized staff members.

Permissions are initially established when the user account or the project account is established. If an account owner wishes to change those permissions or add or delete users to a project account, that owner should contact the Aurora Support Queue.

Transfer of Ownership

If users leave EMSL or projects within EMSL, any data they have placed in the archive is kept and can be transferred to the ownership of another staff member (such as a project manager, principal investigator, or scientist). The new owner then has complete control of the data and access controls. Even if a new owner is not immediately assigned, the data will not automatically be removed.

Getting an Aurora Account

All EMSL users who are part of an active EMSL user proposal are allowed access to Aurora. Request an account by contacting your PI (or by requesting an account using IOPS if you are an internal user).

If you will need more than 500 GB of space for storing your files, include this information in your request.

How to Transfer Files to Aurora

For users outside of PNNL, use one of these tranfer methods with your SecurID credentials:

  • SCP to aurora.emsl.pnl.gov.

  • Globus endpoint emsl#archive.

For users inside PNNL, use one of these transfer methods with your Kerberos/PNL domain credentials:

  • SCP to aurora.emsl.pnl.gov.

  • CIFS (Windows mount) at aurora.emsl.pnl.gov:/archive.

  • HSI transfer client installed on EMSL compute resources.

Transferring files with HTAR

HTAR will move your data into Aurora and TAR it up on the fly. Simple instructions:

Storing data:

/msc/bin/htar -cvf /path/to/destination/file.tar /path/to/source/directory

The -cvf options create (c) an archive, verbosely (v) report the incoming files, and (f) name the TAR file you’re creating. Last argument is the directory or list of files you want to archive.

HTAR will ask you for “principal” (this is your username) and then your password. Then it will begin creating the TAR file for you.

Retrieving data:

/msc/bin/htar –xvf /path/to/file.tar somedata

This will retrieve & extract the file “somedata” from file.tar. Leave this argument off to extract all files.

For more information, see Lawrence Livermore National Lab’s great guide to HTAR usage: https://hpc.llnl.gov/manuals/ezstorage/htar-examples

Transferring files with HSI

HSI is available on EMSL compute resources at /msc/bin/hsi. HSI is significantly faster than all other transfer methods and automatically handles files greater than 2TB. HSI operates similar to a command-line FTP client. Example:

-bash-4.2$ /msc/bin/hsi
Principal: username
Password:
Username: username  UID: 21321  Acct: 0(0) Copies: 1 COS: 0 Firewall: off [hsi.6.0.0.p11 Thu Jan 31 14:15:58 PST 2019]
? lcd /home/username/mydata
? lls
file1  file2  file3
? cd /archive/username/tmp
? put file1 file2 file3
put  'file2' : '/username/tmp/file2' ( 0 bytes, 0.0 KBS (cos=1))
put  'file1' : '/username/tmp/file1' ( 0 bytes, 0.0 KBS (cos=1))
put  'file3' : '/username/tmp/file3' ( 0 bytes, 0.0 KBS (cos=1))
?

Notes:

  • lcd: change local directory

  • cd: change remote directory. Note the leading /archive is unnecessary.

  • put: store a file. You can type “put” by itself to see syntax and avalable options.

  • put -R: recursively store a directory.

  • help: Get a full list of available HSI commands.

Simplified Access

Aurora’s data is mounted on /archive on the Tahoma and Cascade login nodes. If you should have a directory of your own under /archive/, e.g. if your username is d30000, you might see:

[d30000@cu0login3 ~]$ df -h /archive
Filesystem Size Used Avail Use% Mounted on
fuse 3.9P 1.4P 2.5P 36% /archive

[d30000@cu0login3 ~]$ ls /archive/d30000
/archive/d30000/bjobsinfo.output  /archive/d30000/random-junk.out
/archive/d30000/n0.tar            /archive/d30000/xc_v321_x86_64_0.dvdiso

You can copy files to and from your archive directory mostly as you would expect from a normal file system. However, if you copy particularly large files into your archive directory, or particularly old files out of your archive directory, you may find those operations to be slow.

Performance Tips

Recently created or recently accessed files in the archive are likely to be stored on disk, and access to them will be fast. Files that are particularly large or have not been accessed in some time may only be stored on tape. In this case, it may take up to a minute to access the file contents because Aurora’s tape robot must automatically retrieve the correct tape and begin transferring the data.

Important

It is important not to try to search for data in large numbers of files while they are in the archive, as tape access times will cause such a process to be slow. Instead, retrieve copies of the files to a local file system with good performance and perform the operation(s) there.

When storing data in Aurora, it is important to store fewer, larger files vs. many smaller files. Users are encouraged to use utilities such as ZIP and TAR to collate their data before transferring it to Aurora for long-term storage.

Science Contact - Ryan Wright

More about Ryan