Truke, a file conversion tool to check for and handle Excel misidentified gene symbols





Gene symbol conversion to dates check

Output format


Go truke!

Aim

Truke (pronounced ʈɾuke) is a tool to standardize tabular data as used in bioinformatics.

Truke evaluates whether some values are susceptible of being converted into dates by Excel. It is known that Excel and other spreadsheet software may change gene names when importing text data (Zeeberg et al., 2004; Ziemann et al., 2016). A default feature will misidentify specific gene names (e.g.: SET1, AGO3) as dates. This conversion may be avoided by formatting the spreadsheet cells as Text before importing.

Truke is able to rollback safely corrupted data, for instance 02/08/2011 to SEP2. This, however, will only be done for a list of gene symbols for which such transformation can be unambiguously performed. Others, like mar-1 (01/03/2000) will raise a warning as they could refer to either MAR1 or MAR-01, depending on the species.

Truke transforms delimiter-separated text files (spreadsheet-like) by formally specifying the origin and destination field separator (tab, space...), decimal mark (point, comma), missing values (NA, NULL..) and quoting method.

If you find Truke useful please consider citing our paper:

  • Izaskun Mallona, and Miguel A. Peinado (2017). Truke, a web tool to check for and handle excel misidentified gene symbols. BMC Genomics, 18:242. DOI.

Usage

Truke is able to parse the first sheet of XLS and XLSX files, but its gets the most of plain text files, which are recommended (that is, TXT, CSV, TSV and other plain files, read more about text files at Wikipedia). Some of the options described below only apply for text files.

Input

Please upload your XLS, XLSX or text file (the maximum file size is 5 MB). If you don't find your input file in your folder you might need change your file browser settings from 'see all supported files' to 'see all files'. Note that truke only checks the first sheet of Excel spreadsheets.

For text files check, please select the input format by clicking the upper panel radio buttons.

  • Header. A logical variable indicating whether the first line of the file contains the variable names. The header is treated as text no matter what the column content is, text or numeric.

  • Field separator. The character used to specify the limits between columns.

  • Decimal mark. The character used to specify the decimal point symbol (comma or dot).

  • Text quote. Quotes flanking text add flexibility to the variables as they allow to include delimiter characters or other quotes.

  • NA strings. Elements such as ND, NA or NULL will be interpreted as missing values (if clicked). Please note that, if left unclicked, these strings will be treated as text and will coerce the whole column to text, even affecting to numeric values in the same column. If that was the case, the decimal mark and digit grouping methods would not be applied to that column.

  • Digit grouping. Numbers can be formatted in blocks to increase readability. For instance, "10000000" can be written as "10,000,000" or "10 000 000". These digit grouping symbols can be removed by selecting the proper character (i.e. comma or space). Please check whether the decimal mark and digit grouping choices make sense altogether (i.e. they must be different).

Regarding the dynamic table rendered to inspect the data, please note that only the top 10 rows will be displayed, but you will be able to download your whole converted file. For the sake of simplicity some numeric values might be rounded; however, you will be able to download the original data with the original number of digits.

Dates to gene symbol conversion

If kept as Do not transform, Truke won't check for tangled gene symbols. In such a case, you can use the tool to export it as a text file after standardizing decimal symbols, delimiters and so on. Nonetheless, if you uploaded a XLS/XLSX file and selected not to check misidentified gene symbols, Truke will transform cells with dates into text (as yyyy-mm-dd).

Otherwise please select the appropriate date format. This parameter helps to disambiguate dates such as 2010-09-03, which can be read as SEP3 or MAR9, both valid human gene symbols, depending on whether the date format is yyyy-mm-dd or yyyy-dd-mm.

If not sure, you can check for any meaningful date format syntax by selecting Not sure or mixed and, after inspecting the data, choose the proper one. We note that gene symbol to date corruption depends on the Excel language and country settings, so if you are checking your own data we recommend selecting a single date format (the one you default in Excel) to increase the power to rollback converted gene symbols without generating ambiguity tags.

If your data come from a meta-analysis you can check for any meaningful date format syntax by selecting Not sure or mixed.

Output

Once the table preview shows the desired output you can download it by clicking on the Go truke! button.

  • Quote. Whether character columns must be flanked by quotes or not.

  • Decimal mark. The character used to specify the decimal symbol (comma or dot).

  • Field separator. The character used to specify the limits between columns.

Usage tips

  • Gene symbols Truke deals with are updated daily from the gene_info annotation at EBI (ftp parent folder). Current version evaluates the following gene symbols:
  • Selecting Not sure or mixed at Date format will test for any date string following a simple heuristic:
    • Dates are either syllabic items, such as sep-8, or date strings of three numbers separated by slashes or hyphens, such as 02/04/2000.
    • Any four-digit string must be a year, and thus discarded.
    • Months and days have either one or two digits.
    • Years range 0-9999.
    • Months range 1-12.
    • Days range 1-31.
  • The decimal marks and digit grouping choices only affect to numeric columns.
  • Columns with heterogeneous content (that is, numbers and text) will be treated as text. This limitation is intended to prevent unwanted changes in complex data. Allowed exceptions are described in the Input section NA strings.
  • Selection of potentially conflicting combinations of formatting options (e.g. commas for both digit grouping and decimal symbols) may produce odd results.
  • Truke will strip leading and trailing white spaces from unquoted character fields.
  • Truke runs in any modern Web browser. This does not include Internet Explorer 8 and previous versions.

More on Excel gene symbol conversion to dates

Two excellent papers describe the extent of Excel-related data corruption:

  1. Ziemann et al. Genome Biology 2016 17:177
  2. Zeeberg et al. BMC Bioinformatics 2004 5:80

Disclaimer

Truke is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Truke. If not, see http://www.gnu.org/licenses/.

Source code can be accessed at truke repository at bitbucket.

External links

To our knowledge Truke is the first tool offering structured data reformatting in a decimal mark-aware manner. However, there are third-party CSV converters you might find useful, too:

  • ConvertCSV converts CSV to HTML, JSON, etc. by using client-side javascript, so depending on your computer you might be able to reshape moderately big files (i.e. dozens of MB).
  • Reformattext offers many tools for data editing, such as column formatting, data sorting o duplicate lines cleansing; again, it is based upon javascript run on the client-side.

Contact

Any feedback is welcome! Please contact Izaskun Mallona or Miguel A. Peinado.

Truke was developed at the Miguel A. Peinado's lab. The main focus of our research is the characterization of the molecular mechanisms underlying cancer cell programs and the identification of molecular markers with clinical applications.

Please check more details on the lab and our toolshed at maplab.cat.

We are continuously releasing new epigenetics data visualization and integration tools. You might also like:

  1. Methylation plotter, a web tool for dynamic visualization of DNA methylation data (Source Code for Biology and Medicine 2014, 9:11).
  2. Wanderer, an interactive viewer to explore DNA methylation and gene expression data in human cancer (Epigenetics & Chromatin 2015, 8:22).
  3. The Alu knowledgebase, an ontology of the human Alu repeats. (Journal of Biomedical Informatics 2016, 60:77-83)

maplab 2016. Truke is released without any warranty. Developed by Izaskun Mallona. Logo by Julien Douet.