Truke, a file conversion tool to check for and handle Excel misidentified gene symbols
Gene symbol conversion to dates check
Truke (pronounced ʈɾuke) is a tool to standardize tabular data as used in bioinformatics.
Truke evaluates whether some values are susceptible of being converted into dates by Excel. It is known that Excel and other spreadsheet software may change gene names when importing text data (Zeeberg et al., 2004; Ziemann et al., 2016). A default feature will misidentify specific gene names (e.g.: SET1, AGO3) as dates. This conversion may be avoided by formatting the spreadsheet cells as Text before importing.
Truke is able to rollback safely corrupted data, for instance 02/08/2011 to SEP2. This, however, will only be done for a list of gene symbols for which such transformation can be unambiguously performed. Others, like mar-1 (01/03/2000) will raise a warning as they could refer to either MAR1 or MAR-01, depending on the species.
Truke transforms delimiter-separated text files (spreadsheet-like) by formally specifying the origin and destination field separator (tab, space...), decimal mark (point, comma), missing values (NA, NULL..) and quoting method.
If you find Truke useful please consider citing our paper:
Truke is able to parse the first sheet of XLS and XLSX files, but its gets the most of plain text files, which are recommended (that is, TXT, CSV, TSV and other plain files, read more about text files at Wikipedia). Some of the options described below only apply for text files.
Please upload your XLS, XLSX or text file (the maximum file size is 5 MB). If you don't find your input file in your folder you might need change your file browser settings from 'see all supported files' to 'see all files'. Note that truke only checks the first sheet of Excel spreadsheets.
For text files check, please select the input format by clicking the upper panel radio buttons.
Header. A logical variable indicating whether the first line of the file contains the variable names. The header is treated as text no matter what the column content is, text or numeric.
Field separator. The character used to specify the limits between columns.
Decimal mark. The character used to specify the decimal point symbol (comma or dot).
Text quote. Quotes flanking text add flexibility to the variables as they allow to include delimiter characters or other quotes.
NA strings. Elements such as ND, NA or NULL will be interpreted as missing values (if clicked). Please note that, if left unclicked, these strings will be treated as text and will coerce the whole column to text, even affecting to numeric values in the same column. If that was the case, the decimal mark and digit grouping methods would not be applied to that column.
Digit grouping. Numbers can be formatted in blocks to increase readability. For instance, "10000000" can be written as "10,000,000" or "10 000 000". These digit grouping symbols can be removed by selecting the proper character (i.e. comma or space). Please check whether the decimal mark and digit grouping choices make sense altogether (i.e. they must be different).
Regarding the dynamic table rendered to inspect the data, please note that only the top 10 rows will be displayed, but you will be able to download your whole converted file. For the sake of simplicity some numeric values might be rounded; however, you will be able to download the original data with the original number of digits.
If kept as Do not transform, Truke won't check for tangled gene symbols. In such a case, you can use the tool to export it as a text file after standardizing decimal symbols, delimiters and so on. Nonetheless, if you uploaded a XLS/XLSX file and selected not to check misidentified gene symbols, Truke will transform cells with dates into text (as yyyy-mm-dd).
Otherwise please select the appropriate date format. This parameter helps to disambiguate dates such as 2010-09-03, which can be read as SEP3 or MAR9, both valid human gene symbols, depending on whether the date format is yyyy-mm-dd or yyyy-dd-mm.
If not sure, you can check for any meaningful date format syntax by selecting Not sure or mixed and, after inspecting the data, choose the proper one. We note that gene symbol to date corruption depends on the Excel language and country settings, so if you are checking your own data we recommend selecting a single date format (the one you default in Excel) to increase the power to rollback converted gene symbols without generating ambiguity tags.
If your data come from a meta-analysis you can check for any meaningful date format syntax by selecting Not sure or mixed.
Once the table preview shows the desired output you can download it by clicking on the Go truke! button.
Quote. Whether character columns must be flanked by quotes or not.
Decimal mark. The character used to specify the decimal symbol (comma or dot).
Field separator. The character used to specify the limits between columns.
Two excellent papers describe the extent of Excel-related data corruption:
Truke is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with Truke. If not, see http://www.gnu.org/licenses/.
Source code can be accessed at truke repository at bitbucket.
To our knowledge Truke is the first tool offering structured data reformatting in a decimal mark-aware manner. However, there are third-party CSV converters you might find useful, too:
Any feedback is welcome! Please contact Izaskun Mallona or Miguel A. Peinado.
Truke was developed at the Miguel A. Peinado's lab. The main focus of our research is the characterization of the molecular mechanisms underlying cancer cell programs and the identification of molecular markers with clinical applications.
Please check more details on the lab and our toolshed at maplab.cat.
We are continuously releasing new epigenetics data visualization and integration tools. You might also like:
maplab 2016. Truke is released without any warranty. Developed by Izaskun Mallona. Logo by Julien Douet.