|Title:||SuperQuery 1.52, Discovery Edition © 1997|
|Publication info:||Ann Arbor, MI: MPublishing, University of Michigan Library
This work is protected by copyright and may be linked to without seeking permission. Permission must be received for subsequent distribution in print or electronically. Please contact firstname.lastname@example.org for more information.
SuperQuery 1.52, Discovery Edition © 1997
vol. 4, no. 3, November 2001
|Article Type:||Software Review|
SuperQuery 1.52, Discovery Edition © 1997
SuperQuery is a 32-bit datamining program. It requires, at minimum, a 486 processor, Microsoft Windows 95 or NT, 8 MB RAM, and 8MB hard disk space to run. However, these are absolute minimum requirements. For analysis of large datasets or analysis requiring many calculations, a faster machine with more memory is an absolute necessity.
Datamining, the deriving of useful information from raw data, has much to offer historians with large amounts of data – text, numerical, or a combination of both – in spreadsheets or databases. Datamining programs are intended to make the exploration, interpretation and finding of patterns in data easier. Most datamining programs require a background in datamining and often, a background in statistics. This is not the case with SuperQuery.
The program is easy to install, to use, and in my experience, stable. It adheres to standard Windows protocols, though it does introduce a number of innovative features, such as "floating" data views that can be pinned down with virtual thumbtacks. The Help function in the program is useful, but not as complete as the accompanying documentation. The tutorial is clear, and a quick way to learn the program's functions. SuperQuery provides a wizard to read Microsoft Excel, Access, and Fox Pro files, as well as some less commonly used formats. It will also read .txt files, or acquire data via ODBC. I used a very large Excel spreadsheet containing over 5,000 rows and more than 50 columns of information on the Atlantic slave trade. After I defined the columns as either text or numerical in Excel, the wizard worked well and quickly. Once loaded, files are saved in a propriety format. Annoyingly, upon starting the program offers to load the last SuperQuery file worked with, but files with long paths or names are only partially displayed. It is then up to the user to try remember which file was last used, or to cancel the autoload function, and to load the desired file by menus.
The program has four main panes in the primary window that can be resized or closed. The main "Data Table" pane contains all the data in the open file. The "Data Page" pane shows all the information for a single record – a complete row of data. The "Total Page" pane displays user specified calculations on any column selected in the Data Table pane. A wide array of calculations can be chosen. Possible choices range from the simple – count, maximum and minimum values, mean, and medium – to more complex calculations such as the standard deviation, kurtosis, skew, and variance of the data in any given column. Some operations are only relevant for numerical data, while others apply to both numerical and text data. The calculations update automatically as columns are selected. The final pane, the "Reps Graph" pane, automatically graphs the data in the selected column according to most or least common values. This is a useful function, but it is limited. SuperQuery will only draw bar graphs, and will not allow further manipulation of the graphs, or graph more than one column at a time. One of the four panes may be optionally replaced with a "Notes" pane that allows users to write comments while working.
It is possible, with one click, to filter data according to any criteria in the dataset. The function is extremely useful, and works quickly and well. Once a subset of data has been created, it can be duplicated and moved to a separate tab for further analysis. Datasets can be filtered as many times as desired. Filters can be removed one by one, or all together.
A strength of the program is the ease with which "virtual columns" – columns of new data derived from the original data – can be created. They are a cyan color to distinguish them from the original data columns. SuperQuery allows the creation of five types of virtual columns.
- Filter columns allow the user to specify a condition against which the data in a column is tested, and returns TRUE of FALSE. Multiple conditions can be set using a range of logical operators.
- Range columns allow the user to create text ranges out of numerical data. I used this function to categorize the crew sizes on slaving vessels into "small", "medium", and "large". Ranges can be given any name, and as many as desired can be defined.
- Keyword columns search for any given text in a column, and return a user defined value.
- Classification columns are similar to filter columns, but allow data spanning multiple columns to be classified. For example, voyages that had a crew size of between 20 and 25 members that departed from any English port between 1750 and 1785, and slaved in the Bight of Biafra can be designated "group one", even though the information spans several columns.
- Formula columns allow the user to perform various mathematical, logical, date and other calculations on selected data. I used this, for example, to create a virtual column containing the percentage crew mortality on slaving vessels.
The virtual columns in SuperQuery are an extremely useful way of conceptualizing data in new ways. They form the backbone of any data exploration, categorization and mining endeavor. However, the short names that are required for virtual columns by SuperQuery are an annoyance. Similarly, some of the boxes in which the conditions for a given virtual column need to be specified are too cramped, and make it difficult to read long names or formulas.
SuperQuery provides a data analysis wizard, but I found that after using the program for a while, it is easier and quicker to dispense with the wizard and to set up queries by hand. One of the program's most interesting features is the "Fact Discovery Engine". It searches all specified data, including virtual columns, for rules and exceptions. The results are presented in a separate table. I ran the fact discovery engine on a subset of slaving voyages that did not complete their journeys due to natural hazards. The dataset consisted of 463 voyages (rows) with information in 38 fields (columns, original and virtual). By far the majority of rules and exceptions discovered were obvious or uninteresting. For example, the fact engine discovered that, IF the principal place of slave purchase was Bonny, THEN the first region of slave embarkation was the Bight of Biafra (confidence 100%, supported by 22 rows). This was not surprising, as Bonny is in the Bight of Biafra. However, some rules discovered were more interesting. For example, the program discovered that IF a vessel was shipwrecked or destroyed after the embarkation of slaves, or during slaving THEN slaves perished with the ship (confidence 70%, supported by 129 rows). SuperQuery will present its findings in IF – THEN format, or in fact format – i.e. MOST vessels shipwrecked or destroyed after embarkation of slaves, or during slaving, have slaves perished with the ship (70%, supported by 129 rows). Discovered rules can analyzed further in the program, or be printed out. It is possible to specify that either the IF or the THEN portion of a rule be limited to a single column, or to search for rules in an entire dataset.
While interesting, the fact discovery engine is not a panacea. The vast majority of facts and exceptions found are not useful. Those that appear interesting require careful further investigation. Double clicking on the rule will bring up a new view containing those rows on which the rule is based. An examination of the 129 records referred to above, for example, showed that only a fraction of the voyages were known to have lost all of their slaves. Others lost a known portion of their slaves, but for most, it is either not known how many slaves embarked, or how many perished. All that is known is that an indeterminate number died as result of the shipwreck. Occasionally the fact discovery engine finds unexpected nuggets of information in the data, but much depends on how carefully the data are prepared, and on the researcher's skill in creating relevant virtual columns. Even then one must wade through a myriad of uninteresting facts that every search contains. I found that for my dataset the fact discovery engine worked best as a means of generating new ideas and questions to ask of my data.
Of how much use is SuperQuery to historians? If one's data can be contained in a spreadsheet, and there is a great deal of it, the program can be very useful. It allows users not specialized in datamining techniques to explore, summarize and analyze large amounts of data, and aids in the discovery of patterns and hidden information contained in a dataset. The cost of many datamining programs is in the thousands of dollars, and by those standards, SuperQuery is reasonably priced. The program has its limitations, but these are balanced by its ease of use and versatility. AZMY Thinkware discovered a niche in the datamining market by realizing that non-professional statisticians also have uses for datamining techniques, and exploited it with the introduction of SuperQuery. However, the product was last updated in 1997, and this does not bode well for the future. Unless the company adapts to innovations in the field, the competition is sure to catch up and surpass them. But in the meanwhile, it is a program worth considering. An evaluation copy can be downloaded for free at www.azmy.com.