Data retrieval

Data retrieval#

The dataset used in this tutorial is available through the NCBI Sequence Read Archive (SRA). To retrieve it we will use the q2-fondue plugin for programmatic access to sequences and metadata from SRA; we only need to provide a list of accession IDs to download - q2-fondue will take care of the rest.

Note

You need to provide an e-mail address when running this command - this is required by the NCBI as a way to ensure they can contact you in case of any issues.

  • download the files containing all the accession IDs and corresponding metadata:

wget -O ./ids.tsv https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/moshpit_docs/data/ids.tsv
wget -O ./metadata.tsv https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/moshpit_docs/data/metadata.tsv
  • import the file into a QIIME 2 artifact:

mosh tools cache-import \
    --type 'NCBIAccessionIDs' \
    --input-path ./ids.tsv \
    --cache ./cache \
    --key ids
  • run the get-all action from the fondue plugin:

mosh fondue get-all \
    --i-accession-ids ./cache:ids \
    --p-email YOUR.EMAIL@domain.com \
    --p-n-jobs 5 \
    --p-retries 5 \
    --o-paired-reads ./cache:reads_paired \
    --o-metadata ./cache:metadata \
    --o-single-reads ./cache:reads_single \
    --o-failed-runs ./cache:failed_runs \
    --verbose