Snakemake

2025-11-03

Snakemake will always try to find a "Snakefile"
Can use wildcards for file deviations
- Wildcards look like
- So input data/clean_reads/Sample1.fastq.gz
  - data/clean_reads/{sample}.fastq.gz
  - If you don't want to define it as a list, must make a config file to pick up all the files in the folder
nano Snakefile

rule clean_reads:
	input: 
		"data/raw_reads/Sample1.fastaq.gz"
	output:
		"data/clean_reads/Sample1_clean.fastq.gz" #snakemake made this new folder by itself
	conda: #can use conda envs to run the tool 
		"Users/ginnyli/python3/envs/snakemake" #conda env here
	shell:
		fastp -i {input file} -o {output}

Snakemake has a rule checker. Will essentially make a dummy command/dry run to see if there is going to be an appropriate output

snakemake -np [desired output]
snakemake -np data/clean_reads/Sample1_clean.fastq.gz

snakemake -np data/clean_reads/{Sample1, Sample2, Sample3}_clean.fastq.gz #shows that wildcards can be used as a list

File statistics, just add another rule and chain them together to run a seqkit command

rule clean_reads:
	input: 
		"data/raw_reads/{sample}.fastaq.gz"
	output:
		"data/clean_reads/{sample}_clean.fastq.gz" #snakemake made this new folder by itself
	conda: #can use conda envs to run the tool 
		"Users/ginnyli/python3/envs/snakemake" #conda env here
	shell:
		fastp -i {input file} -o {output}

rule fastqstats
	input:
		"data/clean_reads/{sample}_clean.fastq.gz"
	output: 
		"data/stats/{sample}_clean.txt"
	conda:
		"Users/ginnyli/python3/envs/snakemake"
	shell:
		"seqkit stats {input} > {output}"
		
rule taxonomy
	input: 
		"data/stats/{sample}_clean.txt"
	output: 
		"data/stats/{sample}_taxonomy.txt"
	conda:
		"Users/ginnyli/python3/envs/snakemake"
	shell: 
		"whatever the script is"
	
rule ncbi
	input: 
		"data/stats/{sample}_taxonomy.txt"
	output: 
		"data/stats/{sample}_graphical.txt"
	conda: 
		"Users/ginnyli/python3/envs/snakemake"
	shell:
		"whatever the script is"
		
rule all #run all the rules together

Can ask to get a graphic visualization of the steps

snakemake [command request here] --dag | dot -tsvg > dag.svg

snakemake goes one by one, it will stop at that step and let you know.

snakemake will pick up where you left off because it recognizes the inputs automatically
Make sure your code is consistent when you make indents, best use tabs
Dry run is the purpose to avoid running into errors !!
- can check logs for file issues to fix it
Can give it a CPU number to parallelize them

Biology Msc