Run `download_ncbi_genome_sequences.py` like so:

```bash
nohup ~/mambaforge/envs/hgt_analyses/bin/python src/download_ncbi_genome_sequences.py -i ../data/1236_subset_taxa.txt > ../data/nohup_download_genome_seqs.out & disown
```

By default, `-o` output dir is `../data/genome_sequences/` where fasta and gff files are downloaded for each taxon in the list. 

Each of these files is of the form `{NCBI_taxon_ID}_{NCBI_genome_accession_ID}.{extension}`

The TSV file `genome_sequences/1236_subset_accession_ids.tsv` lists out the mapping between taxon_ID, accession_ID, as well as other information such as assembly name and source DB (sometimes it is not REFSEQ)

In [1]:
# import libraries
import itertools

In [4]:
# Then, prepare a multiple sequence fasta file for all of these genomes, and similarly such a file for all of the genes of interest.

# for the genomes, we just need to concatenate all the fasta files
# in the genome_sequences directory, i.e. the output dir of the previous script

# for the genes of interest, we need to first prepare a list of all the gene names.
# for this, first read in the members.tsv file
members_tsv_filepath = '../data/1236_nog_members.tsv'
# read only the 6th column (CSV lists of the gene names)
with open(members_tsv_filepath) as fo:
 flines = fo.readlines()
 gene_names = [line.split('\t')[5] for line in flines]
# now, split the gene names into a list of lists
gene_names = [gn.split(',') for gn in gene_names]
# flatten this huge list of lists efficiently
gene_names = list(itertools.chain.from_iterable(gene_names))
# remove duplicates
gene_names = list(set(gene_names))
print(f'Found {len(gene_names)
 } unique gene names. List looks like this:', gene_names[:10])
# write the gene names to a file, replacing in the members.tsv filepath, 'members' with 'genes' and 'tsv' with 'list'
gene_names_filepath = members_tsv_filepath.replace(
 'members', 'genes').replace('tsv', 'list')
print(f'Writing gene names to {gene_names_filepath}')
with open(gene_names_filepath, 'w') as fo:
 fo.write('\n'.join(gene_names) + '\n')

Found 156509 unique gene names. List looks like this: ['291331.XOO2297', '225937.HP15_481', '573983.B0681_04825', '318167.Sfri_1100', '291331.XOO1980', '472759.Nhal_3786', '1798287.A3F14_06600', '1354304.XPG1_2379', '28173.VIBNI_B0337', '1461581.BN1049_02975']
Writing gene names to ../data/1236_nog_genes.list
