Full-text search

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references).

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques appeared in the 1960s, for example IBM STAIRS from 1969, and became common in online bibliographic databases in the 1990s.[verification needed] Many websites and application programs (such as word processing software) provide full-text-search capabilities. Some web search engines, such as the former AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems.[1]

Indexing

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning". This is what some tools, such as grep, do when searching.

However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.[2]

The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive".

The precision vs. recall tradeoff

Diagram of a low-precision, low-recall search

Recall measures the quantity of relevant results returned by a search, while precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned to all relevant results. Precision is the ratio of the number of relevant results returned to the total number of results returned.

The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only 1 relevant result of 3 possible relevant results was returned, so the recall is a very low ratio of 1/3, or 33%. The precision for the example is a very low 1/4, or 25%, since only 1 of the 4 results returned was relevant.[3]

Due to the ambiguities of natural language, full-text-search systems typically includes options like filtering to increase precision and stemming to increase recall. Controlled-vocabulary searching also helps alleviate low-precision issues by tagging documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall, while an increase in recall lowers precision.[4]

False-positive problem

Full-text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram to the right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).

Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "bank", clustering can be used to categorize the document/data universe into "financial institution", "place to sit", "place to store" etc. Depending on the occurrences of words relevant to the categories, search terms or a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.[clarification needed]

Performance improvements

The deficiencies of full text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.

Improved querying tools

  • Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.
  • Field-restricted search. Some search engines enable users to limit full text searches to a particular field within a stored data record, such as "Title" or "Author."
  • Boolean queries. Searches that use Boolean operators (for example, "encyclopedia" AND "online" NOT "Encarta") can dramatically increase the precision of a full text search. The AND operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The NOT operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, "encyclopedia" AND "online" OR "Internet" NOT "Encarta". This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall.[5]
  • Phrase search. A phrase search matches only those documents that contain a specified phrase, such as "Wikipedia, the free encyclopedia."
  • Concept search. A search that is based on multi-word concepts, for example Compound term processing. This type of search is becoming popular in many e-discovery solutions.
  • Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context.
  • Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for "Wikipedia" WITHIN2 "free" would retrieve only those documents in which the words "Wikipedia" and "free" occur within two words of each other.
  • Regular expression. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision.
  • Fuzzy search will search for document that match the given terms and some variation around them (using for instance edit distance to threshold the multiple variation)
  • Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example, using the asterisk in a search query "s*n" will find "sin", "son", "sun", etc. in a text.

Improved search algorithms

The PageRank algorithm developed by Google gives more prominence to documents to which other Web pages have linked.[6] See Search engine for additional examples.

Software

The following is a partial list of available software products whose predominant purpose is to perform full-text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full-text search may be accomplished.

References

  1. ^ In practice, it may be difficult to determine how a given search engine works. The search algorithms actually employed by web-search services are seldom fully disclosed out of fear that web entrepreneurs will use search engine optimization techniques to improve their prominence in retrieval lists.
  2. ^ "Capabilities of Full Text Search System". Archived from the original on December 23, 2010.
  3. ^ Coles, Michael (2008). Pro Full-Text Search in SQL Server 2008 (Version 1 ed.). Apress Publishing Company. ISBN 978-1-4302-1594-3.
  4. ^ B., Yuwono; Lee, D. L. (1996). Search and ranking algorithms for locating resources on the World Wide Web. 12th International Conference on Data Engineering (ICDE'96). p. 164.
  5. ^ Experimental Comparison of Schemes for Interpreting Boolean Queries
  6. ^ US 6285999, Page, Lawrence, "Method for node ranking in a linked database", published 1998-01-09, issued 2001-09-04.  "A method assigns importance ranks to nodes in a linked database, such as any database of documents containing citations, the world wide web or any other hypermedia database. The rank assigned to a document is calculated from the ranks of documents citing it. In addition, the rank of a document is..."
  7. ^ "SAP Adds HANA-Based Software Packages to IoT Portfolio | MarTech Advisor". www.martechadvisor.com.
  8. ^ "Vertex AI Search". cloud.google.com/enterprise-search.

See also

Read other articles:

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Maret 2016. SMP Negeri 2 PaluInformasiRentang kelasVII, VIII, IXKurikulumKurikulum Tingkat Satuan PendidikanAlamatLokasiJl. Monginsidi 4, Palu, Sulawesi TengahMoto SMP Negeri (SMPN) 2 Palu, merupakan salah satu Sekolah Menengah Pertama Negeri yang ada di Provinsi Sulaw…

Czech tennis player Zdeněk KolářKolář at the 2023 French OpenCountry (sports) Czech RepublicResidenceBystřice nad Pernštejnem, Czech RepublicBorn (1996-10-09) 9 October 1996 (age 27)Bystřice nad Pernštejnem, Czech RepublicHeight1.85 m (6 ft 1 in)Turned pro2014PlaysRight-handed (two-handed backhand)CoachZdenek Kolar Sr.Prize moneyUS $984,091SinglesCareer record1–3 (25.0%)Career titles0Highest rankingNo. 111 (13 June 2022)Current ra…

Panagiotis Tachtsidis Informasi pribadiTanggal lahir 15 Februari 1991 (umur 33)Tempat lahir Nafplio, YunaniTinggi 193 m (633 ft 2 in)Posisi bermain GelandangInformasi klubKlub saat ini Hellas Verona (pinjaman dari Genoa)Nomor 77Karier junior2005–2007 AEK AthensKarier senior*Tahun Tim Tampil (Gol)2007–2010 AEK Athens 18 (2)2010–2012 Genoa 0 (0)2010 → Cesena (loan) 0 (0)2011 → Grosseto (loan) 8 (0)2011-2012 → Hellas Verona (loan) 35 (2)2012–2013 Roma 21 (1)2013–…

Deadpool & WolverineSutradaraShawn LevyProduser Kevin Feige Ryan Reynolds Shawn Levy Ditulis oleh Rhett Reese Paul Wernick Zeb Wells Ryan Reynolds Shawn Levy BerdasarkanDeadpoololeh Fabian NiciezaRob LiefeldPemeran Ryan Reynolds Hugh Jackman Morena Baccarin Brianna Hildebrand Jennifer Garner Penata musikRob SimonsenSinematograferGeorge RichmondPerusahaanproduksi Marvel Studios Maximum Effort 21 Laps Entertainment DistributorWalt Disney StudiosMotion PicturesTanggal rilis 26 Juli 2024&#…

Нидаросский собор, крупнейшая церковь в Норвегии. Религия в Норвегии. Наиболее распространенной религией в Норвегии является лютеранство. Также существует значительное число приверженцев других протестантских церквей. 78,9 % евангелическо-лютеранских церквей страны п…

Obligasi Rusia, Kurva hasil terbalik untuk menjinakkan inflasi selama perang mereka (Perang Rusia-Georgia, Perang Rusia-Ukraina, Invasi Rusia ke Ukraina 2022)   obligasi 20 tahun   obligasi 10 tahun   obligasi 1 tahun   obligasi 3 bulan Rusia gagal bayar utang pada sebagian dari utang mata uang asingnya pada tanggal 27 Juni 2022, gagal bayar utang pertama sejak 1918 dan ini adalah gagal bayar utang secara teknis pertama karena penolakan pembayaran bank dal…

Questa pagina sull'argomento botanica sembra trattare argomenti unificabili alla pagina Scorpioide, che potrebbe confluire qui. Puoi contribuire unendo i contenuti in una pagina unica. Commenta la procedura di unione usando questa pagina di discussione. Segui i suggerimenti del progetto di riferimento. Nell'ambito della botanica una cima è un tipo di infiorescenza caratterizzata dal fatto di terminare con un fiore apicale mentre, lateralmente, presenta ulteriori assi di accrescimento …

Patung dada Durgabai Deshmukh di Rajahmundry Durgābāi Deshmukh, Lady Deshmukh (15 Juli 1909 – 9 Mei 1981) adalah seorang pejuang kemerdekaan, pengacara, pekerja sosial dan politikus asal India. Ia adalah anggota Majelis Konstituen India dan Komisi Perencanaan India. Referensi http://durgabaideshmukhhospitals.com/ Pranala luar Durgabai Deshmukh: A pioneer and a transformative leader, Prema Kasturi and Prema Srinivasan, The Hindu. lbsPenghargaan Padma VibhushanSeni rupa Ebrahim A…

Bagian dari seriIlmu Pengetahuan Formal Logika Matematika Logika matematika Statistika matematika Ilmu komputer teoretis Teori permainan Teori keputusan Ilmu aktuaria Teori informasi Teori sistem FisikalFisika Fisika klasik Fisika modern Fisika terapan Fisika komputasi Fisika atom Fisika nuklir Fisika partikel Fisika eksperimental Fisika teori Fisika benda terkondensasi Mekanika Mekanika klasik Mekanika kuantum Mekanika kontinuum Rheologi Mekanika benda padat Mekanika fluida Fisika plasma Termod…

German Resistance member, president of the state bank of the German Democratic Republic (1902–1981) Greta KuckhoffGreta Kuckhoff (1947)BornMargaretha Lorke14 December 1902Frankfurt (Oder), Province of Brandenburg, German EmpireDied11 November 1981 (aged 78)Frankfurt (Oder), Bezirk Frankfurt, East GermanyNationalityGermanOccupation(s)PoliticianBank PresidentPolitical partyKPDSEDSpouseAdam KuckhoffChildrenUle Kuckhoff Margaretha Greta Kuckhoff (née Lorke; 14 December 1902 – 11 November 1981) …

Ne doit pas être confondu avec Yèvre (Cher). Pour les articles homonymes, voir Yèvre. Yèvre La rivière à Dampierre-le-Château. Cours de l'Yèvre. Caractéristiques Longueur 17,2 km [1] Bassin env 70 km2 Bassin collecteur Seine Nombre de Strahler 2 Cours Source près du château d'eau · Localisation Somme-Yèvre · Altitude 175 m · Coordonnées 48° 56′ 34″ N, 4° 45′ 26″ E Confluence Auve · Localisation entre les quatre communes de V…

Brazilian television series Mystery LabPortugueseMundo Mistério Genre Educational Created by Felipe Castanhari Starring Felipe Castanhari Bruno Miranda Lilian Regina Voices of Guilherme Briggs Country of originBrazilOriginal languagePortugueseNo. of seasons1No. of episodes8ProductionRunning time30 minutesProduction companies Estilingue Filmes Polar Filmes Psycho N Look Webedia Original releaseNetworkNetflixReleaseAugust 4, 2020 (2020-08-04) Mystery Lab (Portuguese: Mundo Mistéri…

USS Mobile Bay Sejarah United States of America Nama Mobile BayAsal nama Battle of Mobile BayDipesan 15 January 1982Pembangun Ingalls Shipbuilding, Pascagoula, MississippiPasang lunas 6 June 1984Diluncurkan 22 August 1985Mulai berlayar 21 February 1987Pelabuhan daftar Naval Base San DiegoIdentifikasi Hull Symbol: CG-53 Call sign: NMOB Motto Full Speed AheadStatus aktif beroperasiLencana Ciri-ciri umum Kelas dan jenis Ticonderoga class cruiserBerat benaman Templat:Ticonderoga class cruiser displa…

Pour les articles homonymes, voir Cotton. Cet article est une ébauche concernant une chanteuse canadienne. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Annie CottonBiographieNaissance 13 juillet 1975 (48 ans)LavalNationalité canadienneActivités Chanteuse, actricePériode d'activité depuis 1993Autres informationsGenre artistique PopPrononciationmodifier - modifier le code - modifier Wikidata Annie Cotton …

Domovoi, atau roh rumah tangga dalam kepercayaan di Eropa Timur. Dewa rumah tangga adalah dewa atau roh yang melindungi rumah, merawat keseluruhan rumah tangga atau anggota keluarga. Kepercayaan ini umum dalam agama pagan, selain juga dalam cerita rakyat di seluruh penjuru dunia. Dewa rumah tangga terbagi menjadi dua jenis; yang pertama adalah dewa yang khusus- biasanya seorang dewi - kadang-kadang disebut sebagai dewi perapian atau dewi rumah, contohnya adalah dewi Hestia dalam mitologi Yunani&…

Not to be confused with Kraków School of Economics. Krakow University of EconomicsUniwersytet Ekonomiczny w KrakowieLatin: Universitas Oeconomicae CracoviensiaMottoRerum cognoscere causas et valoremTypePublicEstablished1925RectorProfessor Stanisław MazurStudents20,000Addressul. Rakowicka 27 31–510 KrakówPoland, Kraków, PolandCampusUrbanColoursBurgundy and Dark green[1]   AffiliationsEUA, NIBES, Socrates-ErasmusWebsitewww.uek.krakow.pl Krakow University of Economics (Polis…

Conan O'BrienO'Brien di San Diego Comic-Con 2019Nama lahirConan Christopher O'BrienLahir18 April 1963 (umur 61) Brookline, MassachusettsMediaTelevisiKebangsaan Amerika SerikatTahun aktif1985—SekarangGenreKomedi Improvisasi, Sketsa komedi, Komedi fisik, Surreal humor, self-deprecationDipengaruhiJohnny Carson, Woody Allen, David Letterman, Robin Williams, Rodney Dangerfield, Mel Brooks, Bob Newhart, Bill MurraySuami/istriElizabeth Ann Powell (2002—sekarang) (2 anak)Karya terkenal dan…

Voce principale: Associazione Sportiva Livorno Calcio. US LivornoStagione 1972-1973 Sport calcio Squadra Livorno Allenatore Andrea Bassi Consigliere delegato Gastone Vivaldi Serie C5º nel girone B Coppa Italia SemiproOttavi di finale Maggiori presenzeCampionato: Lenzi (38) Miglior marcatoreCampionato: Ulivieri (9) StadioArdenza 1971-1972 1973-1974 Si invita a seguire il modello di voce Questa voce raccoglie le informazioni riguardanti l'Unione Sportiva Livorno nelle competizioni uffici…

Protected area in Nevada, United States High Rock CanyonIUCN category Ib (wilderness area)[1]Hikers in High Rock CanyonLocation in United StatesShow map of the United StatesLocation of High Rock Canyon Wilderness in Nevada[2]Show map of NevadaLocationBlack Rock Desert, Nevada, United StatesCoordinates41°23′8.628″N 119°27′17.715″W / 41.38573000°N 119.45492083°W / 41.38573000; -119.45492083Elevation1,743 m (5,719 ft)Established2000Opera…

Tax on property, particularly real estate This article is about real estate property tax. For other types of taxation on assets, see Wealth tax. Part of a series onTaxation An aspect of fiscal policy Policies Government revenue Property tax equalization Tax revenue Non-tax revenue Tax law Tax bracket Flat tax Tax threshold Exemption Credit Deduction Tax shift Tax cut Tax holiday Tax amnesty Tax advantage Tax incentive Tax reform Tax harmonization Tax competition Tax withholding Double taxation R…

Kembali kehalaman sebelumnya