Data retrieval#
The dataset used in this tutorial is available through the NCBI Sequence Read Archive (SRA). To retrieve it we will use the q2-fondue plugin for programmatic access to sequences and metadata from SRA; we only need to provide a list of accession IDs to download - q2-fondue will take care of the rest.
Note
You need to provide an e-mail address when running this command - this is required by the NCBI as a way to ensure they can contact you in case of any issues.
download the files containing all the accession IDs and corresponding metadata:
wget -O ./ids.tsv https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/moshpit_docs/data/ids.tsv
wget -O ./metadata.tsv https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/moshpit_docs/data/metadata.tsv
import the file into a QIIME 2 artifact:
mosh tools cache-import \
--type 'NCBIAccessionIDs' \
--input-path ./ids.tsv \
--cache ./cache \
--key ids
run the
get-all
action from thefondue
plugin:
mosh fondue get-all \
--i-accession-ids ./cache:ids \
--p-email YOUR.EMAIL@domain.com \
--p-n-jobs 5 \
--p-retries 5 \
--o-paired-reads ./cache:reads_paired \
--o-metadata ./cache:metadata \
--o-single-reads ./cache:reads_single \
--o-failed-runs ./cache:failed_runs \
--verbose