Snakemake
2025-11-03
- Snakemake will always try to find a "Snakefile"
- Can use wildcards for file deviations
- Wildcards look like
- So input data/clean_reads/Sample1.fastq.gz
- data/clean_reads/{sample}.fastq.gz
- If you don't want to define it as a list, must make a config file to pick up all the files in the folder
- nano Snakefile
rule clean_reads:
input:
"data/raw_reads/Sample1.fastaq.gz"
output:
"data/clean_reads/Sample1_clean.fastq.gz" #snakemake made this new folder by itself
conda: #can use conda envs to run the tool
"Users/ginnyli/python3/envs/snakemake" #conda env here
shell:
fastp -i {input file} -o {output}
Snakemake has a rule checker. Will essentially make a dummy command/dry run to see if there is going to be an appropriate output
snakemake -np [desired output]
snakemake -np data/clean_reads/Sample1_clean.fastq.gz
snakemake -np data/clean_reads/{Sample1, Sample2, Sample3}_clean.fastq.gz #shows that wildcards can be used as a list
File statistics, just add another rule and chain them together to run a seqkit command
rule clean_reads:
input:
"data/raw_reads/{sample}.fastaq.gz"
output:
"data/clean_reads/{sample}_clean.fastq.gz" #snakemake made this new folder by itself
conda: #can use conda envs to run the tool
"Users/ginnyli/python3/envs/snakemake" #conda env here
shell:
fastp -i {input file} -o {output}
rule fastqstats
input:
"data/clean_reads/{sample}_clean.fastq.gz"
output:
"data/stats/{sample}_clean.txt"
conda:
"Users/ginnyli/python3/envs/snakemake"
shell:
"seqkit stats {input} > {output}"
rule taxonomy
input:
"data/stats/{sample}_clean.txt"
output:
"data/stats/{sample}_taxonomy.txt"
conda:
"Users/ginnyli/python3/envs/snakemake"
shell:
"whatever the script is"
rule ncbi
input:
"data/stats/{sample}_taxonomy.txt"
output:
"data/stats/{sample}_graphical.txt"
conda:
"Users/ginnyli/python3/envs/snakemake"
shell:
"whatever the script is"
rule all #run all the rules together
Can ask to get a graphic visualization of the steps
snakemake [command request here] --dag | dot -tsvg > dag.svg
snakemake goes one by one, it will stop at that step and let you know.
- snakemake will pick up where you left off because it recognizes the inputs automatically
- Make sure your code is consistent when you make indents, best use tabs
- Dry run is the purpose to avoid running into errors !!
- can check logs for file issues to fix it
- Can give it a CPU number to parallelize them