This article by Paul Taylor was published in the Globe & Mail on March 6, 2009:
Plagiarists beware – cybersleuths are on the case
An online program can scan medical databases for cases of copying, helping journal editors ferret out dubious reports
The article is about eTBLAST, a computer program developed by researchers at the University of Texas Southwestern Medical Center in Dallas and available on the Web at eTBLAST.org.
From the Web site:
eTBLAST is best described as a text similarity engine rather than a keyword search engine. For most search engines, such as Google and PubMed, the user must distill their ideas down to a very few keywords, and then try a variety of combinations of them to try to get the most relevant documents. eTBLAST takes a whole paragraph, such as a scientific abstract or, say, an invention description, which usually contains hundreds of keywords, as a query. The user simply pastes in their paragraph into the text box and then submits it to the engine using the “Search” button.
eTBLAST first takes this natural language paragraph, strips it of simple words such as “the, a, of, and” and then it searches its database (Medline, Institute of Physics, US Patent database, etc.) to find those entries that match the maximum number of the remaining keywords, weighted by the frequency of each keyword in all the literature being searched. This is a compute intensive process, but when done it keeps the top 400 ‘hits’ (e.g., Medline abstracts) and then it starts the second phase of the computations. It then does a sentence by sentence alignment, which then accounts for the proximity and order of the words in the query when compared to the abstract ‘hits’. A final similarity score is computed, and then the resulting ‘hits’ are ranked and presented to the user. The ‘hits’ can be viewed in your browser, as a link.