Tekstowe typy danych

22 października 2013 | LoadingDodaj do biblioteki

Characters

Character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language.

Examples of characters include letters, numerical digits, common punctuation marks (such as ". " or "- "), and whitespace. The concept also includes control characters, which do not correspond to symbols in a particular natural language, but rather to other bits of information used to process text in one or more languages. Examples of control characters include carriage return or tab, as well as instructions to printers or other devices that display or otherwise process text.

Character encoding

Computers and communication equipment represent characters using a character encoding that assigns each character to something — an integer quantity represented by a sequence of bits, typically — that can be stored or transmitted through a network. Two examples of usual encodings are ASCII and the UTF-8 encoding for Unicode.

Note on terminology

Historically, the term character has been widely used by industry professionals to refer to an encoded character, often as defined by the programming language or API). Likewise, character set has been widely used to refer to a specific repertoire of characters that have been mapped to specific bit sequences or numerical codes. The term glyph is used to describe a particular visual appearance of a character. Many computer fonts consist of glyphs that are indexed by the numerical code of the corresponding character.

char data type

Many languages have a char type. A char in the C programming language is a data type with the size of exactly one byte, which in turn is defined to be large enough to contain any member of the basic execution character set and UTF-8 code units. This implies a minimum size of 8 bits. Some languages such as C++ use 8 bits like C. Others such as Java use 16 bits for char, in order to represent UTF-16 values.

Characters are typically combined into strings.

Strings

rysunek łańcucha
String is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and/or the length changed, or it may be fixed (after creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding.

Depending on programming language and precise data type used, a variable declared to be a string may either cause storage in memory to be statically allocated for a predetermined maximum length or employ dynamic allocation to allow it to hold variable number of elements.

When a string appears literally in source code, it is known as a string literal and has a representation that denotes it as such.

String concatenation and substrings

Concatenation is an important operation on strings and generally means connecting two strings together.

Substring is a string that is a part of another string, e.g. abc is a substring of abcdef.

Prefixes and suffixes

A string s is said to be a prefix of t if there exists a string u such that t = su. If u is nonempty, s is said to be a proper prefix of t. Symmetrically, a string s is said to be a suffix of t if there exists a string u such that t = us. If u is nonempty, s is said to be a proper suffix of t. Suffixes and prefixes are substrings of t.

String datatypes

A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful datatype that they are implemented in nearly every programming language. In some languages they are available as primitive types and in others as composite types. The syntax of most high-level programming languages allows for a string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal or string literal.

String length

Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often constrained to an artificial maximum. In general, there are two types of string datatypes: fixed-length strings, which have a fixed maximum length and which use the same amount of memory whether this maximum is reached or not, and variable-length strings, whose length is not arbitrarily fixed and which use varying amounts of memory depending on their actual size. Most strings in modern programming languages are variable-length strings. Despite the name, even variable-length strings are limited in length, although, in general, the limit depends only on the amount of memory available. The string length can be stored as a separate integer (which puts a theoretical limit on the length) or implicitly through a termination character, usually a character value with all bits zero.

Implementations

Some languages, such as C++ and Ruby, normally allow the contents of a string to be changed after it has been created; these are termed mutable strings. In other languages, such as Java and Python, the value is fixed and a new string must be created if any alteration is to be made; these are termed immutable strings.

Strings are typically implemented as arrays of bytes, characters, or code units, in order to allow fast access to individual units or substrings — including characters when they have a fixed length. A few languages such as Haskell implement them as linked lists instead.

C-style strings

The length of a string can be stored implicitly by using a special terminating character; often this is the null character (NUL), which has all bits zero, a convention used and perpetuated by the popular C programming language. Hence, this representation is commonly referred to as C or C-style string.

String processing algorithms

There are many algorithms for processing strings, each with various trade-offs. Some categories of algorithms include:

  • String searching algorithms for finding a given substring or pattern
  • String manipulation algorithms
  • Sorting algorithms
  • Regular expression algorithms
  • Parsing a string

Advanced string algorithms often employ complex mechanisms and data structures, among them suffix trees and finite state machines.

Źródło: Wikipedia — Character i Wikipedia — String (tekst został dostosowany do potrzeb kursu)

Dictionary

character
znak
unit of information
jednostka informacji
grapheme
grafem
symbol
symbol
alphabet
alfabet
syllabary
sylabariusz
natural language
język naturalny
letter
litera
numerical digit
cyfra
punctuation mark
znak interpunkcyjny
whitespace
znak biały
control character
znak sterujący
to process
przetwarzać
carriage return
powrót karetki
tab
tabulator
printer
drukarka
device
urządzenie
display
wyświetlacz
character encoding
kodowanie znaków
communication equipment
sprzęt komunikacyjny
to assign
przypisać
to transmit through
przesyłać przez
ASCII
ASCII
UTF-8
UTF-8
Unicode
Unicode
an example of x
przykład x
API
API
character set
zestaw znaków
numerical code
kod liczbowy, kod numeryczny
glyph
glif
the term x is used to describe y
termin x oznacza y
font
font, czcionka
to index
indeksować
char type
typ char
code unit
jednostka kodowa
to imply
implikować
to combine
połączyć, utworzyć kombinację
string
łańcuch
a sequence of characters
sekwencja znaków
literal constant
stała literałowa
the latter
ten drugi
to mutate
zmieniać, modyfikować
array
tablica
to predetermine
określić zawczasu
string literal
literał łańcuchowy
concatenation
konkatenacja, łączenie
substring
podłańcuch
a string s is said to be
łańcuch s nazywa się
prefix
przedrostek, prefiks
if there exists…
jeśli istnieje…
such that
taki, że
nonempty
niepusty
proper prefix
przedrostek właściwy, prefiks właściwy
suffix
przyrostek, sufiks
proper suffix
przyrostek właściwy, sufiks właściwy
string datatype
łańcuchowy typ danych
primitive type
typ prosty
composite type
typ złożony
syntax
składnia
high-level programming language
język programowania wysokiego poziomu
to quote
cytować, ująć w cudzysłów
meta-string
metałańcuch
literal
literał
string literal
literał łańcuchowy
arbitrary
dowolny, przypadkowy
finite
skończony
the length of x
długość x
constrained
ograniczony
fixed-length string
łańcuch o stałej długości
variable-length string
łańcuch o zmiennej długości
limited in length
o ograniczonej długości
to depend on
zależeć od
implicitly
niejawnie
termination character
znak końcowy
to allow the contents of a string to be changed
pozwalać na zmianę zawartości łańcucha
to term
określić, nazwać
mutable string
zmienny łańcuch
alteration
zmiana, modyfikacja
immutable string
niezmienny łańcuch
array of bytes
tablica bajtów
in order to
aby
often this is
często jest to
null character (NUL)
znak pusty, znak null, znak NUL
convention
konwencja
hence
zatem, dlatego, stąd, w związku z tym
C string (C-style string)
łańcuch w stylu języka C
trade-off
kompromis
string searching algorithm
algorytm wyszukiwania łańcuchów
pattern
wzorzec
string manipulation algorithm
algorytm przetwarzania łańcuchów
sorting algorithm
algorytm sortowania
regular expression
wyrażenie regularne
parsing
przetwarzanie, parsowanie
data structure
struktura danych
suffix tree
drzewo sufiksowe
finite state machine
skończona maszyna stanów

Exercises

  1. Text datatypes. Translate the sentences into English
  2. Text datatypes. Translate the sentences into Polish
  3. Text datatypes. Fill in the gaps 2
  4. Text datatypes. Fill in the gaps
  5. Text datatypes. Translate the words or expression in brackets into English
  6. Text datatypes. Provide words for the definitions
  7. Text datatypes. Answer the following questions
  8. Text datatypes. Provide English equivalents of these terms and expressions
  9. Text datatypes. Provide Polish equivalents of these terms and expressions

Grammar corner

Present perfect

The present perfect tense can be used to refer to actions or events that have happened repeatedly for some time up to the present. What’s important in such sentences is the connection of the past the present. Consider the examples below.

Historically, the term character has been widely used by industry professionals to refer to an encoded character, often as defined by the programming language or API). Likewise, character set has been widely used to refer to a specific repertoire of characters that have been mapped to specific bit sequences or numerical codes. The term glyph is used to describe a particular visual appearance of a character.

In the first two sentences the present perfect is used because there’s a connection with the past (historically). However, in the last sentence the present simple is used because there’s no connection with the past.

Find out more about present perfect and present simple.

Licencja: CC-BY-SA 3.0

Odpowiedz

Twój adres email nie zostanie opublikowany. Pola, których wypełnienie jest wymagane, są oznaczone symbolem *