Tester si les lettres peuvent être programmées pour obtenir un mot dans une langue régulière

Je fixe un langage régulier sur un alphabet , et je considère le problème suivant que j'appelle la planification de la lettre pour . De manière informelle, l'entrée me donne lettres et un intervalle pour chaque lettre (c'est-à-dire une position minimale et maximale), et mon objectif est de placer chaque lettre dans son intervalle de telle sorte qu'il n'y ait pas deux lettres mappées à la même position et de sorte que le résultant mot -lettre est dans . Officiellement: $L$ $\Sigma$ $L$ $n$ $n$ $L$

Entrée: $n$ triplets $(a_i, l_i, r_i)$ où $a_i \in \Sigma$ et $1 \leq l_i \leq r_i \leq n$ sont des entiers
Sortie: existe-t-il une bijection $f: \{1, \ldots, n\} \to \{1, \ldots, n\}$ telle que $l_i \leq f(i) \leq r_i$ pour tout $i$ , et $a_{f^{-1}(1)} \cdots a_{f^{-1}(n)} \in L$ .

Évidemment, ce problème est dans NP, en devinant une bijection $f$ et en vérifiant l'appartenance à $L$ dans PTIME. Ma question: existe-t-il un langage régulier $L$ tel que le problème de programmation de lettres pour $L$ est NP-difficile?

Quelques premières observations:

Il semble que des problèmes similaires aient été étudiés dans la planification: nous pourrions voir le problème comme la planification de tâches à coût unitaire sur une seule machine tout en respectant les dates de début et de fin. Cependant, ce dernier problème est évidemment dans PTIME avec une approche gourmande, et je ne vois rien dans la littérature de planification pour le cas où les tâches sont étiquetées et nous voudrions obtenir un mot dans une langue régulière cible.
Une autre façon de voir le problème est comme un cas particulier d'un problème de correspondance maximum bipartites (entre les lettres et les positions), mais encore une fois il est difficile d'exprimer la contrainte que nous devons tomber en $L$ .
Dans le cas spécifique où $L$ est une langue de la forme $u^*$ pour un mot fixe $u$ (par exemple, $(ab)^*$ ), alors le problème d'ordonnancement des lettres pour $L$ est dans PTIME avec un algorithme gourmand facile: construisez le mot à partir de la gauche à droite, et mettez à chaque position l'une des lettres disponibles qui est correcte par rapport à $L$ et qui a le plus petit temps $r_i$ . (S'il n'y a pas de lettres disponibles qui sont correctes, échouent.) Cependant, cela ne se généralise pas aux langues régulières arbitraires $L$ car pour ces langues, nous pouvons avoir le choix du type de lettre à utiliser.
Il semble qu'un algorithme dynamique devrait fonctionner, mais en fait ce n'est pas si simple: il semble que vous auriez besoin de mémoriser le jeu de lettres que vous avez pris jusqu'à présent. En effet, lors de la construction d'un mot de gauche à droite, lorsque vous avez atteint une position $i$ , votre état dépend des lettres que vous avez consommées jusqu'à présent. Vous ne pouvez pas mémoriser l'ensemble entier car il y aurait alors de nombreux états exponentiels. Mais ce n'est pas si facile de la "résumer" (par exemple, par combien de copies de chaque lettre ont été utilisées), car pour savoir quelles copies vous avez utilisées, il semble que vous devez vous rappeler quand vous les avez consommées (plus tard vous en avez consommé les plus de lettres étaient disponibles). Même avec une langue comme $(ab|ba)^*$ , il peut déjà y avoir des contraintes compliquées sur le moment où vous devez choisir de prendre $ab$ et quand vous devez choisir de prendre $ba$ selon les lettres dont vous aurez besoin plus tard et quand les lettres deviendront disponibles.
Cependant, comme le langage régulier $L$ est fixe et ne peut pas mémoriser autant d'informations, j'ai du mal à trouver un problème NP-difficile à partir duquel je pourrais réduire.

— a3nm
source

Pouvez-vous obtenir NP-complétude pour certains L dans PTIME?

— Lance Fortnow

@LanceFortnow Bien sûr. Vous pouvez remplir un 3CNF afin que chaque variable apparaisse dans un nombre pair de littéraux et que toutes les deux occurrences consécutives soient annulées. Encodez

, puis dans l'instance d'ordonnancement des lettres, les symboles

sont fixes tandis que les autres sont à moitié

et à moitié

. En temps polynomial, on peut vérifier si la chaîne code pour un 3CNF rembourré qui est évalué comme vrai. xi $x_i$

0i $0^i$

1i $1^i$

(,),∧,∨ $(,),\wedge,\vee$

0 $0$

1 $1$

— Willard Zhan

Vous pouvez également généraliser le problème aux "positions arbitraires" (non limitées à 1..n). Il est peut-être plus facile de prouver la dureté (si elle est difficile).

— Marzio De Biasi

@MarzioDeBiasi: Je ne suis pas sûr de comprendre, voulez-vous dire que la position des lettres pourrait être un sous-ensemble arbitraire plutôt qu'un intervalle? Je ne sais pas si c'est difficile (cela commence à ressembler un peu au problème de correspondance parfaite exacte ), mais la version avec intervalles permet un algorithme gourmand lorsque

donc j'ai un peu d'espoir que cela pourrait être plus facile. L=u∗ $L= u^*$

— a3nm

@ a3nm: non, je veux dire que vous pouvez généraliser la suppression de la contrainte

; vous demandez un mot dans L dans lequel il y a au moins une lettre

dans la plage

; en d'autres termes, vous ne "construisez" pas le mot complet de longueur

, mais vous demandez un mot de longueur arbitraire contenant les lettres données dans les plages autorisées. Je ne sais pas si cela change la complexité du problème, mais dans ce cas, vous devez faire face à des "index" qui ne sont peut-être pas liés polynomialement par la longueur de l'entrée. ri≤n $r_i \leq n$

ai $a_i$

[li..ri] $[l_i .. r_i]$

n $n$

— Marzio De Biasi

Réponses:

Le problème est NP-difficile pour où est le langage fini contenant les mots suivants: $L = A^*$ $A$

, , $x111$ $x000$
, , , $y100$ $y010$ $y001$
, , et $00c11$ $01c10$ $10c01$ $11c00$

La réduction provient du problème d'orientation du graphique, qui est connu pour être NP-difficile (voir https://link.springer.com/article/10.1007/s00454-017-9884-9 ). Dans ce problème, on nous donne un graphe non orienté à 3 régularités dans lequel chaque sommet est étiqueté " " ou " ". Le but est de diriger les bords du graphe de sorte que le degré extérieur de chaque sommet soit dans l'ensemble étiquetant ce sommet. $\{1\}$ $\{0,3\}$

La réduction doit prendre en entrée une instance d'orientation graphique et produire une liste de triplets en sortie. Dans cette réduction, les triplets que nous produisons satisferont toujours certaines contraintes. Ces contraintes sont répertoriées ci-dessous, et nous ferons référence à une liste de triplets comme valides si et seulement s'ils satisfont à ces contraintes:

Les caractères , et ne reçoivent que des intervalles contenant exactement un index. En d'autres termes, chaque fois que ces personnages sont placés, ils sont placés dans des emplacements spécifiques. $x$ $y$ $c$
Pour chaque triple présent dans l'instance avec $(i, l, r)$ , le triple est également présent. $i \in \{0,1\}$ $(1-i, l, r)$
Si et sont tous les deux des triplets présents dans l'instance, alors soit , soit , soit avec $(\alpha, l, r)$ $(\alpha',l',r')$ $l < l' \le r' < r$ $l' < l \le r < r'$ $\{\alpha,\alpha'\} = \{0,1\}$ . $l = l' < r = r'$
Si est un triple alors le nombre de triplets avec est exactement . $(\alpha, l, r)$ $(\alpha', l', r')$ $l \le l' \le r' \le r$ $r-l+1$

Notez le lemme suivant, prouvé à la fin de ce post.

Lemme: pour une liste valide de triplets, les caractères , et doivent être placés exactement comme indiqué par les triplets, et pour toute paire de triplets et , le deux caractères pour ce triple doivent être placés aux indices et . $x$ $y$ $c$ $(0, l, r)$ $(1, l, r)$ $l$ $r$

L'idée de la réduction est alors la suivante.

Nous utilisons des paires de triplets et pour représenter les arêtes. Le bord va entre les points d'extrémité à l'indice et à l'indice . En supposant que nous produisions une liste valide de triplets, les caractères de ces deux triplets doivent être placés à et , afin que nous puissions traiter l'ordre dans lequel ils sont placés comme indiquant la direction du bord. Ici est la "tête" du bord et est la "queue". En d'autres termes, si le est placé à $(0, l, r)$ $(1, l, r)$ $l$ $r$ $l$ $r$ $1$ $0$ $1$ $r$ alors le bord pointe de à et si le est placé à alors le bord pointe de à . $l$ $r$ $1$ $l$ $r$ $l$

Pour représenter les sommets, nous plaçons un caractère ou à un index et utilisons les trois caractères suivants comme extrémités des trois arêtes qui touchent le sommet. Notez que si l' on place un , les trois arêtes au sommet doivent pointer dans la même direction (tous dans le sommet ou tous du sommet) simplement en raison des chaînes qui sont dans un langage fini . Ces sommets ont un degré ou , nous plaçons donc un exactement pour les sommets étiquetés . Si nous plaçons un $x$ $y$ $x$ $A$ $0$ $3$ $x$ $\{0,3\}$ $y$ , exactly one of the three edges at the vertex must point in the same direction due to the strings in $A$ . Such vertices have outdegree $1$ , so we place a $y$ exactly for the vertices labeled $\{1\}$ .

In some sense, we are done. In particular, the correspondence between solving this instance and solving the Graph Orientation instance should be clear. Unfortunately, the list of triples we produce may not be valid, and so the "edges" described may not work as intended. In particular, the list of triples might not be valid because the condition that the intervals from the triples must always contain each other might not hold: the intervals from two edges may overlap without one containing the other.

Pour lutter contre cela, nous ajoutons quelques infrastructures supplémentaires. En particulier, nous ajoutons des "sommets croisés". Un sommet croisé est un sommet de degré dont les bords sont appariés de telle sorte que dans chaque paire, un bord doit pointer vers le sommet croisé et un vers l'extérieur. En d'autres termes, un sommet croisé se comportera de la même manière que deux arêtes "croisées". Nous représentons un sommet croisé en plaçant le caractère à un certain indice . Notez ensuite que le langage contraint les caractères en et à être opposés (un et un ) et les caractères en $4$ $c$ $i$ $A$ $i-1$ $i+2$ $0$ $1$ $i-2$ et pour être opposé. Ainsi, si nous utilisons ces indices comme points de terminaison pour les quatre arêtes au sommet du croisement, le comportement est exactement comme décrit: les quatre arêtes sont par paires et parmi chaque paire, un point d'entrée et un point de sortie. $i+1$

How do we actually place these crossovers? Well suppose we have two intervals $(l, r)$ and $(l', r')$ which overlap. WLOG, $l < l' < r < r'$ . We add the crossover character into the middle (between $l'$ and $r$ ). (Let's say that all along we spaced everything out so far that there's always enough space, and at the end we will remove any unused space.) Let the index of the crossover character be $i$ . Then we replace the four triples $(0, l, r)$ , $(1, l, r)$ , $(0, l', r')$ , and $(1, l', r')$ with eight triples with two each (one with character $0$ and one with character $1$ ) for the following four intervals $(l, i-1)$ , $(i+2, r)$ , $(l', i-2)$ , $(i+1, r')$ . Notice that the intervals don't overlap in the bad way anymore! (After this change, if two intervals overlap, one is strictly inside the other.) Furthermore, the edge from $l$ to $r$ is replaced by an edge from $l$ to the crossover vertex followed by the edge from there to $r$ ; these two edges are paired at the crossover vertex in such a way that one is pointed in and one is pointed out; in other words, the two edges together behave just like the one edge they are replacing.

In some sense, putting in this crossover vertex "uncrossed" two edges (whose intervals were overlapping). It is easy to see that adding the crossover vertex can't cause any additional edges to become crossed. Thus, we can uncross every pair of crossing edges by inserting enough crossover vertices. The end result still corresponds to the Graph Orientation instance, but now the list of triples is valid (the properties are all easy to verify now that we have "uncrossed" any crossing edges), so the lemma applies, the edges must behave as described, and the correspondence is actually an equivalence. In other words, this reduction is correct.

proof of lemma

Lemma: for a valid list of triples, the characters $x$ , $y$ , and $c$ must be placed exactly as indicated by the triples, and for any pair of triples $(0, l, r)$ and $(1, l, r)$ , the two characters for that triple must be placed at indices $l$ and $r$ .

proof:

We proceed by induction on the triples by interval length. In particular, our statement is the following: for any $k$ if some triple has interval length $k$ then the character in that triple must be placed as described in the lemma.

Base case: for $k = 0$ , the triple must be placing a character $x$ , $y$ , or $c$ at the single index inside the interval. This is exactly as described in the lemma.

Inductive case: assume the statement holds for any $k$ less than some $k'$ . Now consider some triple with interval length $k'$ . Then that triple must be of the form $(i, l, r)$ with $r = l+k'-1$ and $i \in \{0,1\}$ . The triple $(1-i, l, r)$ must also be present. The number of triples $(\alpha',l',r')$ with $l \le l'\le r' \le r$ is exactly $r-l+1 = k'$ . These triples include triples $(0, l, r)$ and $(1, l, r)$ but also $k'-2$ other triples of the form $(\alpha',l',r')$ with $l < l' \le r' < r$ . These other triples all have interval length smaller than $k'$ , so they all must place their characters as specified in the lemma. The only way for this to occur is if these triples place characters in every index starting at index $l+1$ and ending at index $r+1$ . Thus, our two triples $(0, l, r)$ and $(1, l, r)$ must place their characters at indices $l$ and $r$ , as described in the lemma, concluding the inductive case.

By induction, the lemma is correct.

— Mikhail Rudoy
source

Thanks a lot for this elaborate proof, and with a very simple language! I think it is correct, the only thing I'm not sure about is the claim that "adding the crossover vertex can't cause any additional edges to become crossed". Couldn't it be the case that the interval

(l,r) $(l, r)$ included some other interval

(l′′,r′′) $(l'', r'')$ with

l≤l′′≤r′′≤r $l \leq l'' \leq r'' \leq r$ , and now one of

(l,i−1) $(l, i-1)$ and

(i+2,r) $(i+2, r)$ crosses it? It seems like the process still has to converge because the intervals get smaller, but that's not completely clear either because of the insertion of crossover vertices. How should I see it?

— a3nm

l<l′<r<r′ $l <l' <r < r'$ , then you can insert the new indices for the new crossover vertex immediately to the right of

l′ $l'$ . This causes the new indices (

i± $i\pm$ a bit) to be in exactly those intervals that used to contain

l′ $l'$ . It should be easy to see that adding a crossover vertex can add a new crossing with some other interval only if the new indices fall in the other interval. If

l′<l′′<r′′<r′ $l' < l'' < r'' < r'$ then the new indices do not fall into the interval

(l′′,r′′) $(l'', r'')$ . If

l<l′′<r′′<r $l < l'' < r'' < r$ then the new indices might fall into the interval

(l′′,r′′) $(l'', r'')$ , but only if

l′ $l'$ already fell into that

— Mikhail Rudoy

(continued) interval. In this case, you aren't actually creating a new crossing, just turning an old crossing with the old interval

(l,r) $(l,r)$ into a new crossing with the interval

(i+something,r) $(i+\text{something}, r)$

— Mikhail Rudoy

I guess in your second message you meant "with the old interval

(l′,r′) $(l', r')$ " rather than "

(l,r) $(l, r)$ "? But OK, I see it: when you add the crossing vertex, the only bad case would be an interval

I $I$ that overlap with a new interval without overlapping with the corresponding interval. This cannot happen for supersets of

(l,r) $(l, r)$ or of

(l′,r′) $(l', r')$ : if they overlap with a new interval then they overlapped with the old one. Likewise for subsets of

(l,r) $(l, r)$ or

(l′,r′) $(l', r')$ for the reason that you explain. So I agree that this proof looks correct to me. Thanks again!

— a3nm

@MikhailRudoy was the first to show NP-hardness, but Louis and I had a different idea, which I thought I could outline here since it works somewhat differently. We reduce directly from CNF-SAT, the Boolean satisfiability problem for CNFs. In exchange for this, the regular language $L$ that we use is more complicated.

The key to show hardness is to design a language $L'$ that allows us to guess a word and repeat it multiple times. Specifically, for any number $k$ of variables and number $m$ of clauses, we will build intervals that ensure that all words $w$ of $L'$ that we can form must start with an arbitrary word $u$ of length $k$ on alphabet $\{0, 1\}$ (intuitively encoding a guess of the valuation of variables), and then this word $u$ is repeated $m$ times (which we will later use to test that each clause is satisfied by the guessed valuation).

To achieve this, we will fix the alphabet $A = \{0, 1, \#, 0', 1'\}$ and the language: $L' := (0|1)^* (\# (00'|11')^*)^* \# (0|1)^*$ . The formal claim is a bit more complicated:

Claim: For any numbers $k, m \in \mathbb{N}$ , we can build in PTIME a set of intervals such that the words in $L'$ that can be formed with these intervals are precisely:

{u (# (u ~ ⋈ u ~') # (u ⋈ u')) m # u ~ ∣ u \in {0, 1} k}

$\left\{ u (\# (\tilde{u} \bowtie \tilde{u}') \# (u \bowtie u'))^m \# \tilde{u} \mid u \in \{0, 1\}^k\right\}$

where $\tilde{u}$ denotes the result of reversing the order of $u$ and swapping $0$ 's and $1$ 's, where $u'$ denotes the result of adding a prime to all letters in $u$ , and where $x \bowtie y$ for two words $x$ of $y$ of length $p$ is the word of length $2p$ formed by taking alternatively one letter from $x$ and one letter from $y$ .

Here's an intuitive explanation of the construction that we use to prove this. We start with intervals that encode the initial guess of $u$ . Here is the gadget for $n = 4$ (left), and a possible solution (right):

It's easy to show the following observation (ignoring $L'$ for now): the possible words that we can form with these intervals are exactly $u \# \tilde{u}$ for $u \in \{0, 1\}^k$ . This is shown essentially like the Lemma in @MikhailRudoy's answer, by induction from the shortest intervals to the longest ones: the center position must contain $\#$ , the two neighboring positions must contain one $0$ and one $1$ , etc.

We have seen how to make a guess, now let's see how to duplicate it. For this, we will rely on $L'$ , and add more intervals. Here's an illustration for $k = 3$ :

For now take $L := (0|1)^* (\# (00'|11')^*)^* \# (0' | 1')^*$ . Observe how, past the first $\#$ , we must enumerate alternatively an unprimed and a primed letter. So, on the un-dashed triangle of intervals, our observation above still stands: even though it seems like these intervals have more space to the right of the first $\#$ , only one position out of two can be used. The same claim holds for the dashed intervals. Now, $L$ further enforces that, when we enumerate an unprimed letter, the primed letter that follows must be the same. So it is easy to see that the possible words are exactly: $u \# (\tilde{u} \bowtie \tilde{u}') \# u'$ for $u \in \{0, 1\}^k$ .

Now, to show the claim, we simply repeat this construction $m$ times. Here's an example for $k=3$ and $m=2$ , using now the real definition of $L'$ above the statement of the claim:

As before, we could show (by induction on $m$ ) that the possible words are exactly the following: $u (\# \tilde{u} \bowtie \tilde{u}' \# u \bowtie u')^2 \# \tilde{u}$ for $u \in \{0, 1\}^k$ . So this construction achieves what was promised by the claim.

Thanks to the claim we know that we can encode a guess of a valuation for the variables, and repeat the valuation multiple times. The only missing thing is to explain how to check that the valuation satisfies the formula. We will do this by checking one clause per occurrence of $u$ . To do this, we observe that without loss of generality we can assume that each letter of the word is annotated by some symbol provided as input. (More formally: we could assume that in the problem we also provide as input a word $w$ of length $n$ , and we ask whether the intervals can form a word $u$ such that $w \bowtie u$ is in $L$ .) The reason why we can assume this is because we can double the size of each interval, and add unit intervals (at the bottom of the picture) at odd positions to carry the annotation of the corresponding even position:

Thanks to this observation, to check clauses, we will define our regular language $L$ to be the intersection of two languages. The first language enforces that the sub-word on even positions is a word in $L'$ , i.e., if we ignore the annotations then the word must be in $L'$ , so we can just use the construction of the claim and add some annotations. The second language $L''$ will check that the clauses are satisfied. To do this, we will add three letters in our alphabet, to be used as annotations: $+$ , $-$ , and $\epsilon$ . At clause $1 \leq i \leq m$ , we add unit intervals to annotate by $+$ the positions in the $i$ -th repetition of $u$ corresponding to variables occurring positively in clause $i$ , and annotate by~ $-$ the positions corresponding to negatively occurring variables. We annotate everything else by~ $\epsilon$ . It is now clear that $L''$ can check that the guessed valuation satisfies the formula, by verifying that, between each pair of consecutive $\#$ symbols that contain an occurrence of $u$ (i.e., one pair out of two), there is some literal that satisfies the clause, i.e., there must be one occurrence of the subword $+1$ or of the subword $-0$ .

This concludes the reduction from CNF-SAT and shows NP-hardness of the letter scheduling problem for the language $L$ .

— a3nm
source