|
|
Effective Internet Search: Excerpts
Note: Internal book hyperlinks, except those internal
to this page, have been deactivated below.
|
Computer
programmers will recognize that term variation operators
are included in what are known as regular expressions
in programming languages such as PERL. (Actually, the
term "regular expression" comes from
mathematical set theory.) In general, Internet search
engines do not yet have the ability to handle all the
kinds of term variations that can be handled by regular
expressions. More on regular
expressions:
ugrad-www.cs.colorado.edu/unix/regex.html |
- Wildcards, substrings, stemming [3]:
- Wildcards: are operators
that act as placeholders for yet-to-be-determined
characters or groups of characters in a word.
An
asterisk (*) placed at the end of a word can find from 0 to
a certain number of characters in place of the asterisk.
Thus, in various search engines, the term sound* will correspond
to terms such as sounding,
soundproof, sounded, and so on.
The Find commands
of word processors have many more sophisticated wildcard
options than those found in most of today's Internet
search engines. |
In certain search engines, the asterisk can be used in the
middle of a word, for, say, any 0 to 3 characters in that
position in the string of characters.
Sp*l will match words
such as spoil, spill, and spool.
In some search engines, a single character, such as the
question mark (?) or the percent (%) character, can be used
as a wildcard that corresponds to an individual character in
a specific place in a word string.
In the string so??d, any
characters can be used in the third and fourth positions.
Possible matches include solid
and sowed.
Typing the asterisk
at the end of the word is also known as stubbing,
which is a particular kind of wildcarding. The word stemming
may also be used, in its more restrictive usage. Truncation
is another word used to indicate wildcarding. |
You can place wild cards anywhere in the search string,
and you can use multiple wild cards in a single word.
Type an asterisk at the start or end of a word particle to
obtain words that either end with or start with the
specified characters. Thus, the query *man returns documents
containing the words man,
woman, Spiderman, Oman, and so on.
Type ? (question mark) to match a single positional
character. Thus, the query car? will return
documents containing words like cart, card, care, and Cary.
- Substrings: are like
wildcards before and/or after a particular part of a word
because the match is made on a subset of the characters in
a word.
The
substring oma
occurs inside the word woman.
Using wildcards, with the asterisk taking the place of 0 to
n characters, this same substring could be represented as *oma*.
- Stemming: Stemming
operators are somewhat similar to wildcards at the ends of
words. In fact, this is how some search engines appear to
define stemming, in which case the term stubbing
also finds some usage.
Of the search engines featured in this book,
only MSN Search has a word stemming choice on its
advanced search form. Google has automatic stemming, in
the sense of word stubbing. Stemming can also be
suppressed where desirable. |
In a broader sense, however, stemming allows finding
other kinds of variations on the same word, due to
differences in tense or mood, or a word being the verb
equivalent of a particular noun.
When the word think
is entered as the search term, stemming will cause the
search engine to find various connected nouns and verbs,
such as thought, thinker, and thoughtless, as well.
You will get the word flew
when searching for the word fly, along with flies, flying, flight, and so on.
-
Different
spellings or phonetic matching [4]:
- Different
spellings: The ability to automatically
suggest and even automatically include different spellings
of the same word helps to increase the number of
relevant findings in some cases.
Type matherboard
into an appropriate text box, and the Google findings page
will respond with: "Did you mean: motherboard?"
This type of
spelling assistance to different versions of English was
not noticed in any of the search engines featured in
this book. |
Convert between American and British English spellings, as
in behavior vs. behaviour, or humor vs. humour.
More on spelling assistance:
www.brightplanet.com/deepcontent/tutorials/Search/part7.asp#topic27
-
Phonetic matching: This
is matching based on the sound of the word, rather than on
the spelling, based on some dialect or pronunciation. The
search engines sampled in this book do not support
phonetic matching, except perhaps when it is connected
with spelling correction.
Entering Baylin
with phonetic matching will cause the like-sounding words Bailin and Beilin to give rise to
findings as well.
-
Formatting
masks [5]: Formats are often used in programs to
cause data to be displayed to the user in a way that
enhances readability. Formatting masks refer to the
"superficial" appearance characteristics of terms.
A North American phone number consists of ten digits, in the
form "(999) 999-9999", where '9' is a placeholder
for any of the digits 0 to 9. In theory, this formatting
mask could be used to select data on the Internet, where
only documents containing a string of text consisting of
exactly ten successive digits formatted with brackets,
space, and dash, as in this example, would be found.
Entering a word in quotes, like a phrase, will cause
certain search engines to become case sensitive and thereby
distinguish between uppercase and lowercase letters.
Entering "Idea"
as opposed to idea
will result in matches to documents only when the
"I" is capitalized.
In practical terms, formats are seldom used when searching
for text in document files. They are basically absent from
all search engines, except perhaps to find uppercase letters
when required in certain positions of a search term.
Some search engines also respect the usage of diacritical
marks (symbols placed above or below individual letters in a
word) for letters from certain alphabets (non-English ones, of
course). They use the same characters as in English, but add
diacritical marks (signs, accents, cedillas, etc.) to indicate
different sounds or values of a letter, or to add a particular
vowel before or after a consonant. These marks could be
considered as special letter formats, selected using
"formatting masks."
However, formatting or appearance is sometimes used to find
matches for multi-media files.
Advanced image search interfaces generally allow matches to
be made by image color, background pattern, or screen
resolution (pixel density).
|
Practice Exercise:
The word satellite
must occur with an uppercase "S," as in Satellite.
| AlltheWeb |
Case sensitivity is unavailable. |
|
AltaVista |
It is unclear whether case sensitivity is available. |
|
Copernic |
Case sensitivity is available as a
checkbox for search within results. |
| Google |
Case sensitivity is unavailable. |
| MSN Search |
Case sensitivity is unavailable. |
|
- Ignored words or
characters [6]: Often called stop
words, they are words that are ignored when
matching terms to documents. They usually include articles - a,
the; prepositions - at,
to, in; various forms of the verb "to be" - been, is; other "parts of speech."
More examples: how,
which,
if,
la,
de,
on,
who,
where, and single
letter words.
In addition to stop words, one can refer to "stop
punctuation," or, more generally, to "stop
characters." They cause certain words, punctuation
with special keyboard characters, or numerical digits to be
ignored during the match process.
The colon (:) and digit in the phrase overview: conclusion 2,
may be ignored, and treated as if they did not exist. Thus,
the search is really just against the phrase overview conclusion,
without the : and
the 2.
Search engines often do not allow you to control these
features, although they are automatically applied. Many do not
list their stop characters or words either, such as the search
engines featured in this book. However, this is easily
verified by entering the word as a search filter. Some search engines allow you to override the disregard of
stop words by placing a plus sign (+) in front of the stop
word.
+the in Google will
cause the search engine to include the word the when making
matches.
If of and the are stop words in a
given search engine, and punctuation characters and digits
are ignored, then the phrase hello world will be
treated as equivalent to the phrase hello to the world, in +/-2020.
This occurs, since the following will all be ignored:
- preposition: to
- article: the
- special characters: comma (,), plus (+), slash (/), and minus (-)
- digits: 2020
Once you remove the above from hello to the world, in +/-2020,
you end up reducing the string to just hello world.
Internal Book Cross-Links
- Cross-links for
this section:
-
Reference Section 6.1: further explains how the five featured
search engines apply concepts from this section
-
Chapter
4: provides a high-level explanation of the
search filter entry interfaces
- Cross-links with:
Reference Section 6.1: Term variation filters
- Links for
wildcards or substrings or stemming:
- Cross-links with:
Reference Section 6.1: Different spellings or phonetic matching
- Cross-links
with:
Reference Section 6.1: Formatting masks
- Cross-links with:
Reference Section 6.1: Ignored words or characters
|