snakemake -检查点和通配符

我有一个snakemake工作流,它失败了,因为最后一个作业要么创建了两个输出文件,要么一个也没有创建。我尝试过使用检查点来解决这个问题,但我认为在尝试整理聚合函数中的输出文件时,我会使用通配符。

工作流(1)从biom社区配置文件创建fasta文件。然后对fasta文件运行in file (2),这将创建一个txt文件作为输出。

最后一步是解析器(3),它输出csv和fasta文件。但是,如果txt文件中没有匹配项(也就是insilico没有得到任何结果),那么它就不会创建csv或fasta文件。

SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()

TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta", sample = SAMPLES, id = ID)

rule all:
    input:
        TARGETS

rule getgenome:
    input:
        "input/metaphlan/{sample}.biom"
    output:
        csv="output/metaphlan/fasta_dump/{sample}.csv",
        fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
    conda:
        "envs/synth_genome.yaml"
    shell:
        "python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"

rule PCR:
    input:
        "output/metaphlan/fasta_dump/{sample}_dump.fasta"
    output:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    params:
        id = "{id}"
    shell:
        "software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"

rule parse:
    input:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
    shell:
        "python scripts/iPCRess_parser_v2.py {input} {output}"

演练很好--没有错误。但是如果我做了正确的运行,snakemake会中止它,并告诉它作业执行失败:

Waiting at most 5 seconds for missing files.
MissingOutputException in line 31 of snakeflow/Snakefile:
Missing files after 5 seconds:
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.csv
output/metaphlan/isPCR/final/2_mismatch_metaphlan_rectal_SRR5907487.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

我知道我可以更改解析器脚本,只创建两个空文件,但我不想创建不必要的文件。我研究了dynamic,但这不适用于两个潜在的输出文件,所以我查看了checkpoint。据我所知,这应该可以帮助我解决这个问题。

下面是我使用检查点的尝试:

SAMPLES, = glob_wildcards("input/metaphlan/{sample}.biom")
ID = "0 1 2 3 4".split()

TARGETS = expand("output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt", sample = SAMPLES, id = ID)
print(TARGETS)

rule all:
    input:
        TARGETS

rule getgenome:
    input:
        "input/metaphlan/{sample}.biom"
    output:
        csv="output/metaphlan/fasta_dump/{sample}.csv",
        fas="output/metaphlan/fasta_dump/{sample}_dump.fasta"
    conda:
        "envs/synth_genome.yaml"
    shell:
        "python scripts/get_genomes_noabund_Snakemake.py {input} 1 {output.fas} {output.csv}"

rule PCR:
    input:
        "output/metaphlan/fasta_dump/{sample}_dump.fasta"
    output:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    params:
        id = "{id}"
    shell:
        "software/exonerate-2.2.0-x86_64/bin/ipcress --products --mismatch {params.id} scripts/primers-miseq.txt {input} > {output}"

checkpoint parse:
    input:
        "output/metaphlan/isPCR/raw/{id}_mismatch/{sample}.txt"
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv",
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta"
    shell:
        "python scripts/iPCRess_parser_v2.py {input} {output}"

def aggregate_input(wildcards):
    checkpoint_output = checkpoints.parse.get(**wildcards).output[0,1]
    return expand('output/metaphlan/isPCR/final/{id}_mismatch_{sample}.csv','output/metaphlan/isPCR/final/{id}_mismatch_{sample}.fasta', sample = wildcards.SAMPLES, id=wildcards.ID)

rule collect:
    input:
        aggregate_input
    output:
        "output/metaphlan/isPCR/final/{id}_mismatch_{sample}n.txt"
    shell:
        "cat {input} >> {output}"

和错误

<function aggregate_input at 0x7f63eade2158>
SyntaxError:
Input and output files have to be specified as strings or lists of strings.
  File "snakeflow/Snakefile", line 52, in <module>

我相信这是因为我在聚合函数中使用通配符的方式有问题,但我无法弄清楚。我已经尝试了在检查点教程中找到的各种版本,但都没有用。

任何帮助都非常感谢,谢谢!

转载请注明出处:http://www.xinruixiangtm.com/article/20230526/2379021.html