Ensemble-based virtual screening of Mycobacterium tuberculosis ClpC1 inhibitors

Tuberculosis (TB) is an airborne transmissible disease caused by Mycobacterium tuberculosis (Mtb), responsible for 1.3 million deaths per year. Due to increased cases of drug resistance to Mtb, a new treatment regime for TB needs to be discovered. ClpC1 plays a crucial role in the protein homeostasis of Mtb thus, presents as a promising target in controlling TB infection. The present study aimed to identify potential inhibitors for the ClpC1 N-terminal domain (ClpC1-NTD) by applying the relaxed complex scheme in virtual screening that accounts for the target and ligand flexibility. A filtered library of natural product compounds was virtually screened against each of the selected ClpC1-NTD dominant conformations from the ensemble generated using molecular dynamics simulation. The promising compounds with the strong binding affinity to ClpC1 protein were then further analysed for their molecular


Introduction
Tuberculosis (TB) is among the deadliest diseases in the world, and as stated in the WHO TB report, it accounts for 10 million new cases annually [1]. The current TB treatments may need to be replaced with novel approaches to tackle the increasing cases of drug resistance to the etiological agent for TB, Mycobacterium tuberculosis (Mtb) [2]. A promising target for anti-mycobacterial drug development is the caseinolytic protein C1 (ClpC1), a protein chaperone encoded by the Mtb clpC1 gene [3]. It plays a pivotal role in Mtb survival and homeostasis by refolding damaged proteins during infection, thus preventing their degradation [4]. The N-terminal domain (NTD) of ClpC1 contains the binding domain that would trigger a proteolytic complex formation, and based on experimental evidence, ClpC1-NTD is essential for the growth and survival of Mtb [5]. Inhibition of ClpC1-NTD would disrupt the formation of the proteolytic complex, resulting in increased protein toxicity levels and causing mycobactericidal effects, significantly reducing the number of Mtb [6]. As ClpC1-NTD has been structurally and functionally characterised, this makes it an attractive target for drug design and discovery. 3 Through experimental means, several natural product compounds have been identified as potential inhibitors for ClpC1-NTD such as cyclomarin A, ecumicin, and rufomycin I. However, cyclomarin A and ecumicin displayed poor pharmacokinetic properties in mice models due to their large size and high flexibility [7]. Experimental identification of ClpC1-NTD inhibitors is expensive nonetheless, recent technological advances have allowed using computational means to ease the drug discovery process. Finding potential inhibitors using in silico analysis comes under computer-aided drug design (CADD) and has proven to be an efficient, cost-, and time-effective method [8,9].
Among the CADD techniques, the relaxed complex scheme allows to computationally visualise the actual dynamics of a protein [10].
This study employed an ensemble-based virtual screening approach for the identification of potential inhibitors for ClpC1-NTD [11]. The protein is not a rigid structure as often treated by most molecular docking tools but is flexible [8]. Therefore, molecular dynamics (MD) simulation was utilised to ensure that the in silico analysis closely mimicked the natural condition where the protein and ligand are flexible when interacting with each other [12]. Investigation on the promising compound stability with the protein was carried out by a subsequent protein-ligand (complex) MD simulation to strengthen the findings.

ClpC1 structure selection and ligand library construction
The crystal structure of Mtb ClpC1-NTD bound to rufomycin I (PDB ID: 6CN8) with a high resolution (1.40 Å) was selected for this in silico study. Subsequently, loop 4 refinement was carried out for the six missing residues and the model (Model 1) with the lowest normalised discrete optimised protein energy (zDOPE) score (-2.24) was chosen. As stated by [13], a zDOPE score value of less than -1 would produce a reliable model. Furthermore, a Ramachandran plot (Supporting Information File 1, Figure S1) was constructed for Model 1 and indicated that 100% residues were residing in the most favoured regions, thus making it a valid model. Simultaneously the ligand library was constructed based on Lipinski's rule of five criteria [14] for small oral drugs (see Supporting Information File 1, Table S1), and 150 natural product (NP) compounds were filtered. Additionally, rufomycin I (PubChem CID: 76871757) was selected as the positive control because it was reported to have a strong binding affinity to ClpC1 protein [7].

Protein ensemble generation MD simulation and clustering
An ensemble of protein conformations was generated over a 100,000 ps MD simulation run instead of using a rigid structure to avoid bias as ligands are known to favour some protein conformations over others and form stable complexes [12,15]. The energy analysis (Supporting Information File 1, Figure S2) showed that energy minimisation of the system was successful, as indicated by the steady convergence of the potential energy curve and the large negative value (-603,660 kJ/mol) obtained. The computational temperature (300 K) and pressure (1 bar) of the system respectively, were maintained over the entire protein MD simulation. The low total energy value (-466,582 kJ/mol) for the MD simulation run was also an indication that the system had reached equilibrium.

5
For the protein trajectory analysis, there were slightly high fluctuations observed at the beginning of the simulation until 20,000 ps in both root mean square deviation (RMSD) and radius of gyration (Rg) plots ( Figures 1A and B respectively). However, the protein structure was compact before 20,000 ps, and the entire trajectory reached an equilibration state after 20,000 ps. As shown in Figure 1D, the visualized structures at specific time intervals indicated that the protein remained stable and compact. The root mean square fluctuation (RMSF) plot ( Figure 1C) provides information on the protein flexibility based on each of the residues [16]. As expected, the terminal loop regions displayed the highest flexibility. In addition, ClpC1 belongs to the AAA+ protein superfamily which is known to be highly dynamic and contains very flexible loop regions [17]. Contrarily, the RMSF values for the alpha-helices remained low throughout the protein MD simulation indicating well-structured regions [18]. Finally, the protein trajectory was clustered using the RMSD-based (GROMOS) method since it would be computationally unfeasible to utilise each of the 10,001 protein conformations (including the starting structure) generated via MD simulation for the molecular docking analysis. The GROMOS method clusters similar conformations together based on the selected RMSD cut-off value [19]. Table 1

Ensemble-based virtual screening analysis and interactions
The ligand library containing 150 NP compounds was virtually screened against each of the three dominant protein conformations independently using the molecular docking protocol for AutoDock 4.2 [20], resulting in 450 initial dockings with 10 search runs each. From the initial VS analysis, the top 10 ligands were selected based on their lowest binding energy values ( Figure 2), as that indicated the optimal binding pose [15,21]. Furthermore, a re-scoring analysis ( Table 2)   Further analysis was carried out to determine the interactions between the selected protein (conformation 1) and ligand (NP132) complex with the lowest binding energy.
The lowest binding energy (more negative) value indicates that the protein and ligand complex is in the lowest energy state and thus closest to the native configuration [21].
Conformation 1 was selected, as it represented 9,094 of the total 10,001 conformations produced. The two protein residues (Arg-10 and Lys-85) were connected to the respective O atoms of NP132 with three hydrogen bonds (H-bonds) ( Table 3 and Figure 3). Experimental evidence suggested that cyclomarin A and ecumicin bind to Lys-85 [7], and ClpS adaptor protein which interacts with ClpC1-NTD, binds to Arg-10 [22]. This demonstrated that compound NP132 bounded to the same side of ClpC1-NTD as the known inhibitors cyclomarin A and ecumicin. Furthermore, the presence of H-bonds between ligand and the active site of protein could be an indication of inhibitory activity [23]. Moreover, Arg-10 and Lys-85 are within a highly conserved region in Mycobacterium species [24]. Conservation of Arg-10 and Lys-85 across bacterial species and H-bond interactions with the same residues suggests that they may be essential for survival and play a role in inhibitory behaviour [25,26]. The bond lengths formed were also significant as shorter H-bond lengths meant stronger bonds [27]. Additionally, H-bonds formation indicated that the ligand remained stable in the binding site of protein [28].  (blue) and NP132 (ball and stick model), generated using UCSF Chimera.

Protein and ligand complex MD simulation
The stability of the protein conformation 1 and ligand NP132 complex was further evaluated during a 100,000 ps MD simulation. The energy and trajectory analyses were analogous to the protein MD simulation, whereby the protein-ligand complex was successfully minimised (-745,200 kJ/mol), the temperature (300 K) and pressure (1 bar) remained constant, and the total energy (-532,200 kJ/mol) indicated that the system had attained equilibrium (Supporting Information File 1, Figure S3).
Furthermore, the MD trajectory showed an overall stable trend for the calculated RMSD value ( Figure 4A) for the ligand atoms (purple). However, for the Carbon-alpha (Cα) backbone of the protein (black) and the protein-ligand complex (orange), RMSD values were ~0.50 nm at 70,000 ps. Since a smaller deviation indicates a more stable structure, the larger RMSD values could be due to the presence of NP132 disrupting the protein structure. Moreover, the Rg value for the protein ( Figure 1B) was ~1.55 nm and for the complex ( Figure 4C) was ~1.61 nm. A higher Rg value for the complex could indicate a lack of compactness [29] which may also be due to the presence of NP132. However, the solvent-accessible surface area (SASA) plot ( Figure 4D) had a value of 91.06 for the complex, which indicated that the structure had remained intact.
Additionally, the ligand did not dislodge from the protein but remained bounded with Hbonds ( Figure 4E). Also, the farthest distance between the ligand and protein was  Table S2). Since NP132 followed Lipinski's rule of five, this indicates that the compound will have high solubility, permeability, and bioavailability [30]. Contrarily, rufomycin I violated three of Lipinski's rule of five as the molecular weight was greater than 500 Dalton, lipophilicity (logP) was also greater than 5, and there were more than 5 H-bond donors. In addition, NP132 was non-toxic, and rufomycin I exhibited low toxicity in humans. For a compound to have good oral bioavailability, following at least two Lipinski's rule of five is required [31]. Furthermore, for an oral drug, the uptake of a compound through the intestines is essential [32]. The absorption, distribution, metabolism, excretion, and toxicity (ADMET) profile (Table 4) indicated that the gastrointestinal absorption rate for the potent compound NP132 was high as opposed to rufomycin I, which could be attributed to its larger size. As stated by [33], the larger the size (molecular weight) of a compound, the lower the absorption rate will be. The results further showed that NP132 and rufomycin I would not permeate the blood-brain barrier. This physiological barrier prevents the therapeutic compounds in blood from crossing into the brain, leading to lower brain toxicity [34]. The CYP3A4 inhibition was also studied as this is the most abundant of the Cytochrome P450 (CYP) enzymes and is involved in the metabolism of many drugs [35]. Since NP132 was a non-inhibitor, it could be metabolised by the liver [36]. On the contrary, rufomycin I inhibited CYP3A4 probably due to its high logP value (5.1774) and large molecular weight (1012.20 Dalton) (Supporting Information File 1, Table S2), which may cause severe side effects [37]. Additionally, the excretion rates of the compounds from the body were determined. Where the 0.773 log ml/min/kg and -0.081 log ml/min/kg values for a total clearance of NP132 and rufomycin I respectively meant that NP132 would be excreted quickly by the kidneys, while rufomycin I would have a slower excretion rate. Lastly, AMES toxicity was used to test for mutagenicity, and both NP132 and rufomycin I were non-mutagenic. Therefore, they might have a high probability of passing the toxicity tests in vitro and in vivo [38].

Conclusion
Drug resistance to TB has made most of the available treatments inefficient, which has led to a search for novel treatment options. One of the targets being studied currently is the chaperone protein ClpC1 involved in protein homeostasis and is essential for

Experimental Data collection and optimisation
A crystallographic structure (PDB ID: 6CN8) was selected from Protein Data Bank based on its high resolution. This structure was pre-processed using UCSF Chimera 1.13.1 [39] where the crystallised water, ligand, and ions were removed. Then the missing residues in the loop region were refined using MODELLER 9.21 [40], and the model with the lowest zDOPE score was chosen from the seven models that were generated. This model was validated through PROCHECK server [41] to evaluate the stereochemical properties of the modelled protein. Furthermore, the ligands were obtained from Super Natural II, a natural product compounds database. The Lipinski's rule of five criteria which constitutes that the molecular weight of compounds is less than 500 Dalton, logP is not greater than 5, there are not more than 10 H-bond acceptors, and 5 H-bond donors [14] was applied to extract the ligands used in this study. The library also included a positive control (PubChem CID: 76871757) extracted from PubChem database.

Protein MD simulation
The modelled protein structure was subjected to an MD simulation using GROMACS 5.0.4 [42] on an Intel ® Core ™ i7-4930K machine with 96 Gib memory (RAM) equipped with GeForce GTX780 Ti as the graphics processing unit (GPU) and running on Ubuntu Linux package. This stage included the computational generation of protein ensemble and the selection of dominant conformations in the protein trajectory [11].

Conformational ensemble generation
According to a study by [43], the TIP3P water model is the best fit for the CHARMM force field. Therefore, during protein MD simulation the CHARMM36 force field [44] was used for topology generation. Then, the system was solvated using the TIP3P water model [45] in a cubic box with a minimum distance of 1 nm. Next, two Na + ions were added to the system to neutralise the total net charge of -2 and followed by energy minimisation (steepest descent method for 50,000 steps) to remove steric clashes and inappropriate geometry [18]. The minimised system was subjected to the number of particles, volume, and temperature (NVT) equilibration run for 200 ps under constant temperature condition, and the number of particles, pressure, and temperature (NPT) equilibration run for 500 ps under constant pressure condition.
Finally, the production run was conducted for a 100,000 ps time scale under constant temperature (300 K) and pressure (1 bar) using the Langevin thermostat and piston methods [46] respectively. The time step for the MD simulation was 2 fs, and the particle mesh Ewald (PME) method [47] was used to analyse the long-range electrostatic interactions. The 100,000 ps production run was carried out to monitor the conformational changes of protein and generated snapshots every 10 ps interval, which resulted in 10,001 structures, including the initial structure. The graphs for MD simulation results were generated using modules in GROMACS and GRACE 5.1.22 program [48].

RMSD-based clustering
The protein trajectory was clustered based on RMSD using the GROMOS algorithm in GROMACS. A set of general guidelines was followed for determining the best RMSD cut-off value, listed below: 1. The number of total clusters should be less than 40.
2. There should not be many clusters with only one member.
3. More than 90% of the trajectory should be represented in less than seven clusters.
The RMSD cut-off value 1.70 Å, which fulfilled the above criteria was chosen and the representative conformations were separated for subsequent analyses.

Ensemble-based virtual screening
The ligand library was screened against each of the selected protein conformations using AutoDock 4.2 [20]. The respective charges and hydrogens were added to prepare the protein conformations and ligands for molecular docking using AutoDockTools [20]. For each conformation, the grid box with dimensions 70×70×70 with spacing 0.375 Å, was set up on the binding site. Then grid box was centred on coordinates x = -2.442, y = 6.599, and z = -6.208. The dockings were carried out using a Lamarckian genetic algorithm (LGA) [49] with 10 search runs for initial virtual screening and 100 search runs for the re-scoring analysis. Parameters including population size, number of energy evaluations and generations, and mutation and crossover rates were left at default. The ligand with the strongest binding affinity with the protein was selected for further analysis.

Complex MD simulation
To determine the stability of the selected protein-ligand complex, another MD simulation was carried out. The complex MD simulation protocol was similar to the protein MD simulation carried out previously. The exception in the protocol is on the ligand topology, which was generated using an external tool SwissParam [50] and later combined with protein topology. Then the system was subjected to solvation, neutralisation with three Na + ions, and energy minimisation. The NVT equilibration run duration was the same however, the NPT equilibration run was conducted for 1,000 ps under constant pressure condition instead of 500 ps. Production run and analysis of complex MD trajectory involved similar procedures.

ADMET profile generation
To investigate the behaviour of predicted potent compound and the positive control inside the human host, the in silico ADMET profile was generated using SwissADME [51] and pkCSM [52] web-based applications. The simplified molecular-input line-entry system (SMILES) was used as the input for SwissADME and pkCSM. These quick, accurate, and easy-to-use web servers provided information regarding the gastrointestinal absorption rate, blood-brain barrier permeability, inhibition of CYP3A4 (an important CYP isoform), drug clearance from the body, and mutagenicity of the uploaded compounds.

Supporting Information
Supporting Information File 1: File Name: Additional results data File Format: docx